Why UTF-8/16 character encodings? (page 4) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Why UTF-8/16 character encodings? (page 4)

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Peter Alexander
in reply to Peter Alexander

Peter Alexander

Posted in reply to Peter Alexander

On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
> int countSpaces(const(char)* c)
> {
>     int n = 0;
>     while (*c)
>         if (*c == ' ')
>             ++n;
>     return n;
> }

Oops. Missing a ++c in there, but I'm sure the point was made :-)

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Vladimir Panteleev
in reply to Joakim

Vladimir Panteleev

Posted in reply to Joakim

On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
> On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote:
>> On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
>>>> If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless.
>>> Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding.
>>
>> No. Are you sure you understand UTF-8 properly?
> Are you sure _you_ understand it properly?  Both encodings have to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above.  Obviously the constant-width encoding will be faster.  Did I really need to explain this?

It looks like you've missed an important property of UTF-8: lower ASCII remains encoded the same, and UTF-8 code units encoding non-ASCII characters cannot be confused with ASCII characters. Code that does not need Unicode code points can treat UTF-8 strings as ASCII strings, and does not need to decode each character individually - because a 0x20 byte will mean "space" regardless of context. That's why a function that splits a string by ASCII whitespace does NOT need do perform UTF-8 decoding.

I hope this clears up the misunderstanding :)

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by H. S. Teoh
in reply to Joakim

H. S. Teoh

Posted in reply to Joakim

On Sat, May 25, 2013 at 03:47:41PM +0200, Joakim wrote:
> On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote:
> >On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
> >>>If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless.
> >>Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding.
> >
> >No. Are you sure you understand UTF-8 properly?
> Are you sure _you_ understand it properly?  Both encodings have to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above.  Obviously the constant-width encoding will be faster.  Did I really need to explain this?
[...]

Have you actually tried to write a whitespace splitter for UTF-8? Do you realize that you can use an ASCII whitespace splitter for UTF-8 and it will work correctly?

There is no need to decode UTF-8 for whitespace splitting at all. There is no need to parse anything. You just iterate over the bytes and split on 0x20. There is no performance difference over ASCII.

As Dmitry said, UTF-8 is self-synchronizing. While current Phobos code tries to play it safe by decoding every character, this is not necessary in many cases.

T

-- 
The best compiler is between your ears. -- Michael Abrash

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Vladimir Panteleev

Joakim

Posted in reply to Vladimir Panteleev

On Saturday, 25 May 2013 at 14:18:32 UTC, Vladimir Panteleev wrote:
> On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
>> Are you sure _you_ understand it properly?  Both encodings have to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above.  Obviously the constant-width encoding will be faster.  Did I really need to explain this?
>
> It looks like you've missed an important property of UTF-8: lower ASCII remains encoded the same, and UTF-8 code units encoding non-ASCII characters cannot be confused with ASCII characters. Code that does not need Unicode code points can treat UTF-8 strings as ASCII strings, and does not need to decode each character individually - because a 0x20 byte will mean "space" regardless of context. That's why a function that splits a string by ASCII whitespace does NOT need do perform UTF-8 decoding.
>
> I hope this clears up the misunderstanding :)
OK, you got me with this particular special case: it is not necessary to decode every UTF-8 character if you are simply comparing against ASCII space characters.  My mixup is because I was unaware if every language used its own space character in UTF-8 or if they reuse the ASCII space character, apparently it's the latter.

However, my overall point stands.  You still have to check 2-4 times as many bytes if you do it the way Peter suggests, as opposed to a single-byte encoding.  There is a shortcut: you could also check the first byte to see if it's ASCII or not and then skip the right number of ensuing bytes in a character's encoding if it isn't ASCII, but at that point you have begun partially decoding the UTF-8 encoding, which you claimed wasn't necessary and which will degrade performance anyway.

On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
> I suggest you read up on UTF-8. You really don't understand it. There is no need to decode, you just treat the UTF-8 string as if it is an ASCII string.
Not being aware of this shortcut doesn't mean not understanding UTF-8.

> This code will count all spaces in a string whether it is encoded as ASCII or UTF-8:
>
> int countSpaces(const(char)* c)
> {
>     int n = 0;
>     while (*c)
>         if (*c == ' ')
>             ++n;
>     return n;
> }
>
> I repeat: there is no need to decode. Please read up on UTF-8. You do not understand it. The reason you don't need to decode is because UTF-8 is self-synchronising.
Not quite.  The reason you don't need to decode is because of the particular encoding scheme chosen for UTF-8, a side effect of ASCII backwards compatibility and reusing the ASCII space character; it has nothing to do with whether it's self-synchronizing or not.

> The code above tests for spaces only, but it works the same when searching for any substring or single character. It is no slower than fixed-width encoding for these operations.
It doesn't work the same "for any substring or single character," it works the same for any single ASCII character.

Of course it's slower than a fixed-width single-byte encoding.  You have to check every single byte of a non-ASCII character in UTF-8, whereas a single-byte encoding only has to check a single byte per language character.  There is a shortcut if you partially decode the first byte in UTF-8, mentioned above, but you seem dead-set against decoding. ;)

> Again, I urge you, please read up on UTF-8. It is very well designed.
I disagree.  It is very badly designed, but the ASCII compatibility does hack in some shortcuts like this, which still don't save its performance.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Daniel Murphy
in reply to Manu

Daniel Murphy

Posted in reply to Manu

"Manu" <turkeyman@gmail.com> wrote in message news:mailman.137.1369448229.13711.digitalmars-d@puremagic.com...
>>
>> One of the first, and best, decisions I made for D was it would be
>> Unicode
>> front to back.
>>
>
> Indeed, excellent decision!
> So when we define operators for u × v and a · b, or maybe n²? ;)

When these have keys on standard keyboards.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Dmitry Olshansky
in reply to Joakim

Dmitry Olshansky

Posted in reply to Joakim

25-May-2013 10:44, Joakim пишет:
> On Friday, 24 May 2013 at 21:21:27 UTC, Dmitry Olshansky wrote:
>> You seem to think that not only UTF-8 is bad encoding but also one
>> unified encoding (code-space) is bad(?).
> Yes, on the encoding, if it's a variable-length encoding like UTF-8, no,
> on the code space.  I was originally going to title my post, "Why
> Unicode?" but I have no real problem with UCS, which merely standardized
> a bunch of pre-existing code pages.  Perhaps there are a lot of problems
> with UCS also, I just haven't delved into it enough to know.  My problem
> is with these dumb variable-length encodings, so I was precise in the
> title.
>

UCS is dead and gone. Next in line to "640K is enough for everyone".
Simply put Unicode decided to take into account all diversity of luggages instead of ~80% of these. Hard to add anything else. No offense meant but it feels like you actually live in universe that is 5-7 years behind current state. UTF-16 (a successor to UCS) is no random-access either. And it's shitty beyond measure, UTF-8 is a shining gem in comparison.

>> Separate code spaces were the case before Unicode (and utf-8). The
>> problem is not only that without header text is meaningless (no easy
>> slicing) but the fact that encoding of data after header strongly
>> depends a variety of factors -  a list of encodings actually. Now
>> everybody has to keep a (code) page per language to at least know if
>> it's 2 bytes per char or 1 byte per char or whatever. And you still
>> work on a basis that there is no combining marks and regional specific
>> stuff :)
> Everybody is still keeping code pages, UTF-8 hasn't changed that.

Legacy. Hard to switch overnight. There are graphs that indicate that few years from now you might never encounter a legacy encoding anymore, only UTF-8/UTF-16.

>  Does
> UTF-8 not need "to at least know if it's 2 bytes per char or 1 byte per
> char or whatever?"

It's coherent in its scheme to determine that. You don't need extra information synced to text unlike header stuff.

> It has to do that also. Everyone keeps talking about
> "easy slicing" as though UTF-8 provides it, but it doesn't.  Phobos
> turns UTF-8 into UTF-32 internally for all that ease of use, at least
> doubling your string size in the process.  Correct me if I'm wrong, that
> was what I read on the newsgroup sometime back.

Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't do any decoding and it does return you a slice of a balance of original.

>
>> In fact it was even "better" nobody ever talked about header they just
>> assumed a codepage with some global setting. Imagine yourself creating
>> a font rendering system these days - a hell of an exercise in
>> frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ
>> then ...).
> I understand that people were frustrated with all the code pages out
> there before UCS standardized them, but that is a completely different
> argument than my problem with UTF-8 and variable-length encodings.  My
> proposed simple, header-based, constant-width encoding could be
> implemented with UCS and there go all your arguments about random code
> pages.

No they don't - have you ever seen native Korean or Chinese codepages? Problems with your header based approach are self-evident in a sense that there is no single sane way to deal with it on cross-locale basis (that you simply ignore as noted below).

>> This just shows you don't care for multilingual stuff at all. Imagine
>> any language tutor/translator/dictionary on the Web. For instance most
>> languages need to intersperse ASCII (also keep in mind e.g. HTML
>> markup). Books often feature citations in native language (or e.g.
>> latin) along with translations.
> This is a small segment of use and it would be handled fine by an
> alternate encoding.

??? Simply makes no sense. There is no intersection between some legacy encodings as of now. Or do you want to add N*(N-1) cross-encodings for any combination of 2? What about 3 in one string?

>> Now also take into account math symbols, currency symbols and beyond.
>> Also these days cultures are mixing in wild combinations so you might
>> need to see the text even if you can't read it. Unicode is not only
>> "encode characters from all languages". It needs to address universal
>> representation of symbolics used in writing systems at large.
> I take your point that it isn't just languages, but symbols also.  I see
> no reason why UTF-8 is a better encoding for that purpose than the kind
> of simple encoding I've suggested.
>
>> We want monoculture! That is to understand each without all these
>> "par-le-vu-france?" and codepages of various complexity(insanity).
> I hate monoculture, but then I haven't had to decipher some screwed-up
> codepage in the middle of the night. ;)

So you never had trouble of internationalization? What languages do you use (read/speak/etc.)?

>That said, you could standardize
> on UCS for your code space without using a bad encoding like UTF-8, as I
> said above.

UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into that trap (Java, Windows NT). You shouldn't.

>> Want small - use compression schemes which are perfectly fine and get
>> to the precious 1byte per codepoint with exceptional speed.
>> http://www.unicode.org/reports/tr6/
> Correct me if I'm wrong, but it seems like that compression scheme
> simply adds a header and then uses a single-byte encoding, exactly what
> I suggested! :)

This is it but it's far more flexible in a sense that it allows multi-linguagal strings just fine and lone full-with unicode codepoints as well.

> But I get the impression that it's only for sending over
> the wire, ie transmision, so all the processing issues that UTF-8
> introduces would still be there.

Use mime-type etc. Standards are always a bit stringy and suboptimal, their acceptance rate is one of chief advantages they have. Unicode has horrifically large momentum now and not a single organization aside from them tries to do this dirty work (=i18n).

>> And borrowing the arguments from from that rant: locale is borked shit
>> when it comes to encodings. Locales should be used for tweaking visual
>> like numbers, date display an so on.
> Is that worse than every API simply assuming UTF-8, as he says? Broken
> locale support in the past, as you and others complain about, doesn't
> invalidate the concept.

It's combinatorial blowup and has some stone-walls to hit into. Consider adding another encoding for "Tuva" for isntance. Now you have to add 2*n conversion routines to match it to other codepages/locales.

Beyond that - there are many things to consider in internationalization and you would have to special case them all by codepage.

> If they're screwing up something so simple,
> imagine how much worse everyone is screwing up something complex like
> UTF-8?

UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to a sequence of octets. It does it pretty well and compatible with ASCII, even the little rant you posted acknowledged that. Now you are either against Unicode as whole or what?

-- 
Dmitry Olshansky

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Dmitry Olshansky
in reply to Joakim

Dmitry Olshansky

Posted in reply to Joakim

25-May-2013 13:05, Joakim пишет:
> On Saturday, 25 May 2013 at 08:42:46 UTC, Walter Bright wrote:
>> I think you stand alone in your desire to return to code pages.
> Nobody is talking about going back to code pages.  I'm talking about
> going to single-byte encodings, which do not imply the problems that you
> had with code pages way back when.

Problem is what you outline is isomorphic with code-pages. Hence the grief of accumulated experience against them.
>> Code pages simply are no longer practical nor acceptable for a global
>> community. D is never going to convert to a code page system, and even
>> if it did, there's no way D will ever convince the world to abandon
>> Unicode, and so D would be as useless as EBCDIC.
> I'm afraid you and others here seem to mentally translate "single-byte
> encodings" to "code pages" in your head, then recoil in horror as you
> remember all your problems with broken implementations of code pages,
> even though those problems are not intrinsic to single-byte encodings.
>
> I'm not asking you to consider this for D.  I just wanted to discuss why
> UTF-8 is used at all.  I had hoped for some technical evaluations of its
> merits, but I seem to simply be dredging up a bunch of repressed
> memories about code pages instead. ;)

Well if somebody get a quest to redefine UTF-8 they *might* come up with something that is a bit faster to decode but shares the same properties. Hardly a life saver anyway.
>
> The world may not "abandon Unicode," but it will abandon UTF-8, because
> it's a dumb idea.  Unfortunately, such dumb ideas- XML anyone?- often
> proliferate until someone comes up with something better to show how
> dumb they are.

Even children know XML is awful redundant shit as interchange format. The hierarchical document is a nice idea anyway.

> Perhaps it won't be the D programming language that does
> that, but it would be easy to implement my idea in D, so maybe it will
> be a D-based library someday. :)

Implement Unicode compression scheme - at least that is standardized.



-- 
Dmitry Olshansky

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Dmitry Olshansky
in reply to Vladimir Panteleev

Dmitry Olshansky

Posted in reply to Vladimir Panteleev

25-May-2013 12:58, Vladimir Panteleev пишет:
> On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:
>>> This is more a problem with the algorithms taking the easy way than a
>>> problem with UTF-8. You can do all the string algorithms, including
>>> regex, by working with the UTF-8 directly rather than converting to
>>> UTF-32. Then the algorithms work at full speed.
>> I call BS on this.  There's no way working on a variable-width
>> encoding can be as "full speed" as a constant-width encoding. Perhaps
>> you mean that the slowdown is minimal, but I doubt that also.
>
> For the record, I noticed that programmers (myself included) that had an
> incomplete understanding of Unicode / UTF exaggerate this point, and
> sometimes needlessly assume that their code needs to operate on
> individual characters (code points), when it is in fact not so - and
> that code will work just fine as if it was written to handle ASCII. The
> example Walter quoted (regex - assuming you don't want Unicode ranges or
> case-insensitivity) is one such case.

+1
BTW regex even with Unicode ranges and case-insensitivity is doable just not easy (yet).

> Another thing I noticed: sometimes when you think you really need to
> operate on individual characters (and that your code will not be correct
> unless you do that), the assumption will be incorrect due to the
> existence of combining characters in Unicode. Two of the often-quoted
> use cases of working on individual code points is calculating the string
> width (assuming a fixed-width font), and slicing the string - both of
> these will break with combining characters if those are not accounted
> for.  I believe the proper way to approach such tasks is to implement the
> respective Unicode algorithms for it, which I believe are non-trivial
> and for which the relative impact for the overhead of working with a
> variable-width encoding is acceptable.

Another plus one. Algorithms defined on code point basis are quite complex so that benefit of not decoding won't be that large. The benefit of transparently special-casing ASCII in UTF-8 is far larger.

> Can you post some specific cases where the benefits of a constant-width
> encoding are obvious and, in your opinion, make constant-width encodings
> more useful than all the benefits of UTF-8?
>
> Also, I don't think this has been posted in this thread. Not sure if it
> answers your points, though:
>
> http://www.utf8everywhere.org/
>
> And here's a simple and correct UTF-8 decoder:
>
> http://bjoern.hoehrmann.de/utf-8/decoder/dfa/


-- 
Dmitry Olshansky

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Diggory
in reply to Joakim

Diggory

Posted in reply to Joakim

On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
> On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
>> I think you are a little confused about what unicode actually is... Unicode has nothing to do with code pages and nobody uses code pages any more except for compatibility with legacy applications (with good reason!).
> Incorrect.
>
> "Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes... In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode."
> http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode
>

That confirms exactly what I just said...

>> You said that phobos converts UTF-8 strings to UTF-32 before operating on them but that's not true. As it iterates over UTF-8 strings it iterates over dchars rather than chars, but that's not in any way inefficient so I don't really see the problem.
> And what's a dchar?  Let's check:
>
> dchar : unsigned 32 bit UTF-32
> http://dlang.org/type.html
>
> Of course that's inefficient, you are translating your whole encoding over to a 32-bit encoding every time you need to process it.  Walter as much as said so up above.

Given that all the machine registers are at least 32-bits already it doesn't make the slightest difference. The only additional operations on top of ascii are when it's a multi-byte character, and even then it's some simple bit manipulation which is as fast as any variable width encoding is going to get.

The only alternatives to a variable width encoding I can see are:
- Single code page per string
This is completely useless because now you can't concatenate strings of different code pages.

- Multiple code pages per string
This just makes everything overly complicated and is far slower to decode what the actual character is than UTF-8.

- String with escape sequences to change code page
Can no longer access characters in the middle or end of the string, you have to parse the entire string every time which completely negates the benefit of a fixed width encoding.

- An encoding wide enough to store every character
This is just UTF-32.

>
>> Also your complaint that UTF-8 reserves the short characters for the english alphabet is not really relevant - the characters with longer encodings tend to be rarer (such as special symbols) or carry more information (such as chinese characters where the same sentence takes only about 1/3 the number of characters).
> The vast majority of non-english alphabets in UCS can be encoded in a single byte.  It is your exceptions that are not relevant.

Well obviously... That's like saying "if you know what the exact contents of a file are going to be anyway you can compress it to a single byte!"

ie. It's possible to devise an encoding which will encode any given string to an arbitrarily small size. It's still completely useless because you'd have to know the string in advance...

- A useful encoding has to be able to handle every unicode character
- As I've shown the only space-efficient way to do this is using a variable length encoding like UTF-8
- Given the frequency distribution of unicode characters, UTF-8 does a pretty good job at encoding higher frequency characters in fewer bytes.
- Yes you COULD encode non-english alphabets in a single byte but doing so would be inefficient because it would mean the more frequently used characters take more bytes to encode.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Dmitry Olshansky

Joakim

Posted in reply to Dmitry Olshansky

On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote:
> 25-May-2013 10:44, Joakim пишет:
>> Yes, on the encoding, if it's a variable-length encoding like UTF-8, no,
>> on the code space.  I was originally going to title my post, "Why
>> Unicode?" but I have no real problem with UCS, which merely standardized
>> a bunch of pre-existing code pages.  Perhaps there are a lot of problems
>> with UCS also, I just haven't delved into it enough to know.
>
> UCS is dead and gone. Next in line to "640K is enough for everyone".
I think you are confused.  UCS refers to the Universal Character Set, which is the backbone of Unicode:

http://en.wikipedia.org/wiki/Universal_Character_Set

You might be thinking of the unpopular UCS-2 and UCS-4 encodings, which I have never referred to.

>>> Separate code spaces were the case before Unicode (and utf-8). The
>>> problem is not only that without header text is meaningless (no easy
>>> slicing) but the fact that encoding of data after header strongly
>>> depends a variety of factors -  a list of encodings actually. Now
>>> everybody has to keep a (code) page per language to at least know if
>>> it's 2 bytes per char or 1 byte per char or whatever. And you still
>>> work on a basis that there is no combining marks and regional specific
>>> stuff :)
>> Everybody is still keeping code pages, UTF-8 hasn't changed that.
>
> Legacy. Hard to switch overnight. There are graphs that indicate that few years from now you might never encounter a legacy encoding anymore, only UTF-8/UTF-16.
I didn't mean that people are literally keeping code pages.  I meant that there's not much of a difference between code pages with 2 bytes per char and the language character sets in UCS.

>> Does
>> UTF-8 not need "to at least know if it's 2 bytes per char or 1 byte per
>> char or whatever?"
>
> It's coherent in its scheme to determine that. You don't need extra information synced to text unlike header stuff.
?!  It's okay because you deem it "coherent in its scheme?"  I deem headers much more coherent. :)

>> It has to do that also. Everyone keeps talking about
>> "easy slicing" as though UTF-8 provides it, but it doesn't.  Phobos
>> turns UTF-8 into UTF-32 internally for all that ease of use, at least
>> doubling your string size in the process.  Correct me if I'm wrong, that
>> was what I read on the newsgroup sometime back.
>
> Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't do any decoding and it does return you a slice of a balance of original.
Perhaps substring search doesn't strictly require decoding but you have changed the subject: slicing does require decoding and that's the use case you brought up to begin with.  I haven't looked into it, but I suspect substring search not requiring decoding is the exception for UTF-8 algorithms, not the rule.

> ??? Simply makes no sense. There is no intersection between some legacy encodings as of now. Or do you want to add N*(N-1) cross-encodings for any combination of 2? What about 3 in one string?
I sketched two possible encodings above, none of which would require "cross-encodings."

>>> We want monoculture! That is to understand each without all these
>>> "par-le-vu-france?" and codepages of various complexity(insanity).
>> I hate monoculture, but then I haven't had to decipher some screwed-up
>> codepage in the middle of the night. ;)
>
> So you never had trouble of internationalization? What languages do you use (read/speak/etc.)?
This was meant as a point in your favor, conceding that I haven't had to code with the terrible code pages system from the past.  I can read and speak multiple languages, but I don't use anything other than English text.

>>That said, you could standardize
>> on UCS for your code space without using a bad encoding like UTF-8, as I
>> said above.
>
> UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into that trap (Java, Windows NT). You shouldn't.
UCS, the character set, as noted above.  If that's a myth, Unicode is a myth. :)

> This is it but it's far more flexible in a sense that it allows multi-linguagal strings just fine and lone full-with unicode codepoints as well.
That's only because it uses a more complex header than a single byte for the language, which I noted could be done with my scheme, by adding a more complex header, long before you mentioned this unicode compression scheme.

>> But I get the impression that it's only for sending over
>> the wire, ie transmision, so all the processing issues that UTF-8
>> introduces would still be there.
>
> Use mime-type etc. Standards are always a bit stringy and suboptimal, their acceptance rate is one of chief advantages they have. Unicode has horrifically large momentum now and not a single organization aside from them tries to do this dirty work (=i18n).
You misunderstand.  I was saying that this unicode compression scheme doesn't help you with string processing, it is only for transmission and is probably fine for that, precisely because it seems to implement some version of my single-byte encoding scheme!  You do raise a good point: the only reason why we're likely using such a bad encoding in UTF-8 is that nobody else wants to tackle this hairy problem.

> Consider adding another encoding for "Tuva" for isntance. Now you have to add 2*n conversion routines to match it to other codepages/locales.
Not sure what you're referring to here.

> Beyond that - there are many things to consider in internationalization and you would have to special case them all by codepage.
Not necessarily.  But that is actually one of the advantages of single-byte encodings, as I have noted above.  toUpper is a NOP for a single-byte encoding string with an Asian script, you can't do that with a UTF-8 string.

>> If they're screwing up something so simple,
>> imagine how much worse everyone is screwing up something complex like
>> UTF-8?
>
> UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to a sequence of octets. It does it pretty well and compatible with ASCII, even the little rant you posted acknowledged that. Now you are either against Unicode as whole or what?
The BOM link I gave notes that UTF-8 isn't always ASCII-compatible.

There are two parts to Unicode.  I don't know enough about UCS, the character set, ;) to be for it or against it, but I acknowledge that a standardized character set may make sense.  I am dead set against the UTF-8 variable-width encoding, for all the reasons listed above.

On Saturday, 25 May 2013 at 17:13:41 UTC, Dmitry Olshansky wrote:
> 25-May-2013 13:05, Joakim пишет:
>> Nobody is talking about going back to code pages.  I'm talking about
>> going to single-byte encodings, which do not imply the problems that you
>> had with code pages way back when.
>
> Problem is what you outline is isomorphic with code-pages. Hence the grief of accumulated experience against them.
They may seem superficially similar but they're not.  For example, from the beginning, I have suggested a more complex header that can enable multi-language strings, as one possible solution.  I don't think code pages provided that.

> Well if somebody get a quest to redefine UTF-8 they *might* come up with something that is a bit faster to decode but shares the same properties. Hardly a life saver anyway.
Perhaps not, but I suspect programmers will flock to a constant-width encoding that is much simpler and more efficient than UTF-8.  Programmer productivity is the biggest loss from the complexity of UTF-8, as I've noted before.

>> The world may not "abandon Unicode," but it will abandon UTF-8, because
>> it's a dumb idea.  Unfortunately, such dumb ideas- XML anyone?- often
>> proliferate until someone comes up with something better to show how
>> dumb they are.
>
> Even children know XML is awful redundant shit as interchange format. The hierarchical document is a nice idea anyway.
_We_ both know that, but many others don't, or XML wouldn't be as popular as it is. ;) I'm making a similar point about the more limited success of UTF-8, ie it's still shit.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation