Why UTF-8/16 character encodings? (page 7) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Why UTF-8/16 character encodings? (page 7)

May 26, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Diggory

Joakim

Posted in reply to Diggory

On Saturday, 25 May 2013 at 18:56:42 UTC, Diggory wrote:
> "limited success of UTF-8"
>
> Becoming the de-facto standard encoding EVERYWERE except for windows which uses UTF-16 is hardly a failure...
So you admit that UTF-8 isn't used on the vast majority of computers since the inception of Unicode.  That's what I call limited success, thank you for agreeing with me. :)

> I really don't understand your hatred for UTF-8 - it's simple to decode and encode, fast and space-efficient. Fixed width encodings are not inherently fast, the only thing they are faster at is if you want to randomly access the Nth character instead of the Nth byte. In the rare cases that you need to do a lot of this kind of random access there exists UTF-32...
Space-efficient?  Do you even understand what a single-byte encoding is?  Suffice to say, a single-byte encoding beats UTF-8 on all these measures, not just one.

> Any fixed width encoding which can encode every unicode character must use at least 3 bytes, and using 4 bytes is probably going to be faster because of alignment, so I don't see what the great improvement over UTF-32 is going to be.
Slaps head.  You don't need "at least 3 bytes" because you're packing language info in the header.  I don't think you even know what I'm talking about.

>> slicing does require decoding
> Nope.
Of course it does, at least partially.  There is no other way to know where the code points are.

>> I didn't mean that people are literally keeping code pages.  I meant that there's not much of a difference between code pages with 2 bytes per char and the language character sets in UCS.
>
> Unicode doesn't have "language character sets". The different planes only exist for organisational purposes they don't affect how characters are encoded.
Nobody's talking about different planes.  I'm talking about all the different language character sets in this list:

http://en.wikipedia.org/wiki/List_of_Unicode_characters

>> ?!  It's okay because you deem it "coherent in its scheme?"  I deem headers much more coherent. :)
>
> Sure if you change the word "coherent" to mean something completely different... Coherent means that you store related things together, ie. everything that you need to decode a character in the same place, not spread out between part of a character and a header.
Coherent means that the organizational pieces fit together and make sense conceptually, not that everything is stored together.  My point is that putting the language info in a header seems much more coherent to me than ramming that info into every character.

>> but I suspect substring search not requiring decoding is the exception for UTF-8 algorithms, not the rule.
> The only time you need to decode is when you need to do some transformation that depends on the code point such as converting case or identifying which character class a particular character belongs to. Appending, slicing, copying, searching, replacing, etc. basically all the most common text operations can all be done without any encoding or decoding.
Slicing by byte, which is the only way to slice without decoding, is useless, I have to laugh that you even include it. :) All these basic operations can be done very fast, often faster than UTF-8, in a single-byte encoding.  Once you start talking code points, it's no contest: UTF-8 flat out loses.

On Saturday, 25 May 2013 at 19:42:41 UTC, Diggory wrote:
>> All a code page is is a table of mappings, UCS is just a much larger, standardized table of such mappings.
>
> UCS does have nothing to do with code pages, it was designed as a replacement for them. A codepage is a strict subset of the possible characters, UCS is the entire set of possible characters.
"[I]t was designed as a replacement for them" by combining several of them into a master code page and removing redundancies.  Functionally, they are the same and historically they maintain the same layout in at least some cases.  To then say, UCS has "nothing to do with code pages" is just dense.

>> I see you've abandoned without note your claim that phobos doesn't convert UTF-8 to UTF-32 internally.  Perhaps converting to UTF-32 is "as fast as any variable width encoding is going to get" but my claim is that single-byte encodings will be faster.
>
> I haven't "abandoned my claim". It's a simple fact that phobos does not convert UTF-8 string to UTF-32 strings before it uses them.
>
> ie. the difference between this:
> string mystr = ...;
> dstring temp = mystr.to!dstring;
> for (int i = 0; i < temp.length; ++i)
>     process(temp[i]);
>
> and this:
> string mystr = ...;
> size_t i = 0;
> while (i < mystr.length) {
>     dchar current = decode(mystr, i);
>     process(current);
> }
>
> And if you can't see why the latter example is far more efficient I give up...
I take your point that phobos is often decoding by char as it iterates through, but there are still functions in std.string that convert the entire string, as in your first example.  The point is that you are forced to decode everything to UTF-32, whether by char or the entire string.  Your latter example may be marginally more efficient but it is only useful for functions that start from the beginning and walk the string in only one direction, which not all operations do.

>>> - Multiple code pages per string
>>> This just makes everything overly complicated and is far slower to decode what the actual character is than UTF-8.
>> I disagree, this would still be far faster than UTF-8, particularly if you designed your header right.
>
> The cache misses alone caused by simply accessing the separate headers would be a larger overhead than decoding UTF-8 which takes a few assembly instructions and has perfect locality and can be efficiently pipelined by the CPU.
Lol, you think a few potential cache misses is going to be slower than repeatedly decoding, whether in assembly and pipelined or not, every single UTF-8 character? :D

> Then there's all the extra processing involved combining the headers when you concatenate strings. Plus you lose the one benefit a fixed width encoding has because random access is no longer possible without first finding out which header controls the location you want to access.
There would be a few arithmetic operations on substring indices when concatenating strings, hardly anything.

Random access is still not only possible, it is incredibly fast in most cases: you just have to check first if the header lists any two-byte encodings.  This can be done once and cached as a property of the string (set a boolean no_two_byte_encoding once and simply have the slice operator check it before going ahead), just as you could add a property to UTF-8 strings to allow quick random access if they happen to be pure ASCII.  The difference is that only strings that include the two-byte encoded Korean/Chinese/Japanese characters would require a bit more calculation for slicing in my scheme, whereas _every_ non-ASCII UTF-8 string requires full decoding to allow random access.  This is a clear win for my single-byte encoding, though maybe not the complete demolition of UTF-8 you were hoping for. ;)

>> No, it's not the same at all.  The contents of an arbitrary-length file cannot be compressed to a single byte, you would have collisions galore.  But since most non-english alphabets are less than 256 characters, they can all be uniquely encoded in a single byte per character, with the header determining what language's code page to use.  I don't understand your analogy whatsoever.
>
> It's very simple - the more information about the type of data you are compressing you have at the time of writing the algorithm the better compression ration you can get, to the point that if you know exactly what the file is going to contain you can compress it to nothing. This is why you have specialised compression algorithms for images, video, audio, etc.
This may be mostly true in general, but your specific example of compressing down to a byte is nonsense.  For any arbitrarily long data, there are always limits to compression.  What any of this has to do with my single-byte encoding, I have no idea.

> It doesn't matter how few characters non-english alphabets have - unless you know WHICH alphabet it is before-hand you can't store it in a single byte. Since any given character could be in any alphabet the best you can do is look at the probabilities of different characters appearing and use shorter representations for more common ones. (This is the basis for all lossless compression) The english alphabet plus 0-9 and basic punctuation are by far the most common characters used on computers so it makes sense to use one byte for those and multiple bytes for rarer characters.
How many times have I said that "you know WHICH alphabet it is before-hand" because that info is stored in the header?  That is why I specifically said, from my first post, that multi-language strings would have more complex headers, which I later pointed out could list all the different language substrings within a multi-language string.  Your silly exposition of how compression works makes me wonder if you understand anything about how a single-byte encoding would work.

Perhaps it made sense to use one byte for ASCII characters and relegate _every other language_ to multiple bytes two decades ago.  It doesn't make sense today.

>>> - As I've shown the only space-efficient way to do this is using a variable length encoding like UTF-8
>> You haven't shown this.
> If you had thought through your suggestion of multiple code pages per string you would see that I had.
You are not packaging and transmitting the code pages with the string, just as you do not ship the entire UCS with every UTF-8 string.  A single-byte encoding is going to be more space-efficient for the vast majority of strings, everybody knows this.

>> No, it does a very bad job of this.  Every non-ASCII character takes at least two bytes to encode, whereas my single-byte encoding scheme would encode every alphabet with less than 256 characters in a single byte.
>
> And strings with mixed characters would use lots of memory and be extremely slow. Common when using proper names, quotes, inline translations, graphical characters, etc. etc. Not to mention the added complexity to actually implement the algorithms.
Ah, you have finally stumbled across the path to a good argument, though I'm not sure how, given your seeming ignorance of how single-byte encodings work. :) There _is_ a degenerate case with my particular single-byte encoding (not the ones you list, which would still be faster and use less memory than UTF-8): strings that use many, if not all, character sets.  So the worst case scenario might be something like a string that had 100 characters, every one from a different language.  In that case, I think it would still be smaller than the equivalent UTF-8 string, but not by much.

There might be some complexity in implementing the algorithms, but on net, likely less than UTF-8, while being much more usable for most programmers.

On Saturday, 25 May 2013 at 22:41:59 UTC, Diggory wrote:
> 1) Take the byte at a particular offset in the string
> 2) If it is ASCII then we're done
> 3) Otherwise count the number of '1's at the start of the byte - this is how many bytes make up the character (there's even an ASM instruction to do this)
> 4) This first byte will look like '1110xxxx' for a 3 byte character, '11110xxx' for a 4 byte character, etc.
> 5) All following bytes are of the form '10xxxxxx'
> 6) Now just concatenate all the 'x's together and add an offset to get the code point
Not sure why you chose to write this basic UTF-8 stuff out, other than to bluster on without much use.

> Note that this is CONSTANT TIME, O(1) with minimal branching so well suited to pipelining (after the initial byte the other bytes can all be processed in parallel by the CPU) and only sequential memory access so no cache misses, and zero additional memory requirements
It is constant time _per character_.  You have to do it for _every_ non-ASCII character in your string, so the decoding adds up.

> Now compare your encoding:
> 1) Look up the offset in the header using binary search: O(log N) lots of branching
It is difficult to reason about the header, because it all depends on the number of languages used and how many substrings there are.  There are worst-case scenarios that could approach something like log(n) but extremely unlikely in real-world use.  Most of the time, this would be O(1).

> 2) Look up the code page ID in a massive array of code pages to work out how many bytes per character
Hardly, this could be done by a simple lookup function that simply checked if the language was one of the few alphabets that require two bytes.

> 3) Hope this array hasn't been paged out and is still in the cache
> 4) Extract that many bytes from the string and combine them into a number
Lol, I love how you think this is worth listing as a separate step for the few two-byte encodings, yet have no problem with doing this for every non-ASCII character in UTF-8.

> 5) Look up this new number in yet another large array specific to the code page
Why?  The language byte and number uniquely specify the character, just like your Unicode code point above.  If you were simply encoding the UCS in a single-byte encoding, you would arrange your scheme in such a way to trivially be able to generate the UCS code point using these two bytes.

> This is O(log N) has lots of branching so no pipelining (every stage depends on the result of the stage before), lots of random memory access so lots of cache misses, lots of additional memory requirements to store all those tables, and an algorithm that isn't even any easier to understand.
Wrong on practically every count, as detailed above.

> Plus every other algorithm to operate on it except for decoding is insanely complicated.
They are still much _less_ complicated than UTF-8, that's the comparison that matters.

May 26, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Dmitry Olshansky

Joakim

Posted in reply to Dmitry Olshansky

On Saturday, 25 May 2013 at 19:58:25 UTC, Dmitry Olshansky wrote:
> Runs away in horror :) It's mess even before you've got to details.
Perhaps it's fatally flawed, but I don't see an argument for why, so I'll assume you can't find such a flaw.  It is still _much less_ messy than UTF-8, that is the critical distinction.

> Another point about using sometimes a 2-byte encoding - welcome to the nice world of BigEndian/LittleEndian i.e. the very trap UTF-16 has stepped into.
I don't think this is a sizable obstacle.  It takes some coordination, but it is a minor issue.

On Saturday, 25 May 2013 at 20:20:11 UTC, Juan Manuel Cabo wrote:
> You obviously are not thinking it through. Such encoding would have a O(n^2) complexity for appending a character/symbol in a different language to the string, since you would have to update the beginning of the string, and move the contents forward to make room. Not to mention that it wouldn't be backwards compatible with ascii routines, and the complexity of such a header would be have to be carried all the way to font rendering routines in the OS.
You obviously have not read the rest of the thread, both your non-font-related assertions have been addressed earlier.  I see no reason why a single-byte encoding of UCS would have to be carried to "font rendering routines" but UTF-8 wouldn't be.

> Multiple languages/symbols in one string is a blessing of modern humane computing. It is the norm more than the exception in most of the world.
I disagree, but in any case, most of this thread refers to multi-language strings.  The argument is about how best to encode them.

On Saturday, 25 May 2013 at 20:47:25 UTC, Peter Alexander wrote:
> On Saturday, 25 May 2013 at 14:58:02 UTC, Joakim wrote:
>> On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
>>> I suggest you read up on UTF-8. You really don't understand it. There is no need to decode, you just treat the UTF-8 string as if it is an ASCII string.
>> Not being aware of this shortcut doesn't mean not understanding UTF-8.
>
> It's not just a shortcut, it is absolutely fundamental to the design of UTF-8. It's like saying you understand Lisp without being aware that everything is a list.
It is an accidental shortcut because of the encoding scheme chosen for UTF-8 and, as I've noted, still less efficient than similarly searching a single-byte encoding.  The fact that you keep trumpeting this silly detail as somehow "fundamental" suggests you have no idea what you're talking about.

> Also, you continuously keep stating disadvantages to UTF-8 that are completely false, like "slicing does require decoding". Again, completely missing the point of UTF-8. I cannot conceive how you can claim to understand how UTF-8 works yet repeatedly demonstrating that you do not.
Slicing on code points requires decoding, I'm not sure how you don't know that.  If you mean slicing by byte, that is not only useless, but _every_ encoding can do that.  I cannot conceive how you claim to defend UTF-8, yet keep making such stupid points, that you don't even bother backing up.

> You are either ignorant or a successful troll. In either case, I'm done here.
Must be nice to just insult someone who has demolished your arguments and leave.  Good riddance, you weren't adding anything.

May 26, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to H. S. Teoh

Joakim

Posted in reply to H. S. Teoh

For some reason this posting by H. S. Teoh shows up on the mailing list but not on the forum.

On Sat May 25 13:42:10 PDT 2013, H. S. Teoh wrote:
On Sat, May 25, 2013 at 10:07:41AM +0200, Joakim wrote:
>> The vast majority of non-english alphabets in UCS can be encoded in
>> a single byte.  It is your exceptions that are not relevant.
>
> I'll have you know that Chinese, Korean, and Japanese account for a
> significant percentage of the world's population, and therefore
> arguments about "vast majority" are kinda missing the forest for the
> trees. If you count the number of *alphabets* that can be encoded in a
> single byte, you can get a majority, but that in no way reflects actual
> usage.
Not just "a majority," the vast majority of alphabets, representing 85% of the world's population.

>> >The only alternatives to a variable width encoding I can see are:
>> >- Single code page per string
>> >This is completely useless because now you can't concatenate
>> >strings of different code pages.
>> I wouldn't be so fast to ditch this.  There is a real argument to be
>> made that strings of different languages are sufficiently different
>> that there should be no multi-language strings.  Is this the best
>> route?  I'm not sure, but I certainly wouldn't dismiss it out of hand.
>
> This is so patently absurd I don't even know how to begin to answer...
> have you actually dealt with any significant amount of text at all? A
> large amount of text in today's digital world are at least bilingual, if
> not more. Even in pure English text, you occasionally need a foreign
> letter in order to transcribe a borrowed/quoted word, e.g., "cliché",
> "naïve", etc.. Under your scheme, it would be impossible to encode any
> text that contains even a single instance of such words. All it takes is
> *one* word in a 500-page text and your scheme breaks down, and we're
> back to the bad ole days of codepages. And yes you can say "well just
> include é and ï in the English code page". But then all it takes is a
> single math formula that requires a Greek letter, and your text is
> non-encodable anymore. By the time you pull in all the French, German,
> Greek letters and math symbols, you might as well just go back to UTF-8.
I think you misunderstand what this implies.  I mentioned it earlier as another possibility to Walter, "keep all your strings in a single language, with a different format to compose them together."  Nobody is talking about disallowing alphabets other than English or going back to code pages.  The fundamental question is whether it makes sense to combine all these different alphabets and their idiosyncratic rules into a single string and encoding.

There is a good argument to be made that the differences outweigh the similarities and you'd be better off keeping each language/alphabet in its own string.  It's a question of modeling, just like a class hierarchy.  As I said, I'm not sure this is the best route, but it has some real strengths.

> The alternative is to have embedded escape sequences for the rare
> foreign letter/word that you might need, but then you're back to being
> unable to slice the string at will, since slicing it at the wrong place
> will produce gibberish.
No one has presented this as a viable option.

> I'm not saying UTF-8 (or UTF-16, etc.) is panacea -- there are things
> about it that are annoying, but it's certainly better than the scheme
> you're proposing.
I disagree.

On Saturday, 25 May 2013 at 20:52:41 UTC, H. S. Teoh wrote:
> And just how exactly does that help with slicing? If anything, it makes
> slicing way hairier and error-prone than UTF-8. In fact, this one point
> alone already defeated any performance gains you may have had with a
> single-byte encoding. Now you can't do *any* slicing at all without
> convoluted algorithms to determine what encoding is where at the
> endpoints of your slice, and the resulting slice must have new headers
> to indicate the start/end of every different-language substring. By the
> time you're done with all that, you're going way slower than processing
> UTF-8.
There are no convoluted algorithms, it's a simple check if the string contains any two-bye encodings, a check which can be done once and cached.  If it's single-byte all the way through, no problems whatsoever with slicing.  If there are two-byte languages included, the slice function will have to do a little arithmetic calculation before slicing.  You will also need a few arithmetic ops to create the new header for the slice.  The point is that these operations will be much faster than decoding every code point to slice UTF-8.

> Again I say, I'm not 100% sold on UTF-8, but what you're proposing here
> is far worse.
Well, I'm glad you realize some problems with UTF-8, :) even if you dismiss my alternative out of hand.

May 26, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Walter Bright

Joakim

Posted in reply to Walter Bright

On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
>> I have noted from the beginning that these large alphabets have to be encoded to
>> two bytes, so it is not a true constant-width encoding if you are mixing one of
>> those languages into a single-byte encoded string.  But this "variable length"
>> encoding is so much simpler than UTF-8, there's no comparison.
>
> If it's one byte sometimes, or two bytes sometimes, it's variable length. You overlook that I've had to deal with this. It isn't "simpler", there's actually more work to write code that adapts to one or two byte encodings.
It is variable length, with the advantage that only strings containing a few Asian languages are variable-length, as opposed to UTF-8 having every non-English language string be variable-length.  It may be more work to write library code to handle my encoding, perhaps, but efficiency and ease of use are paramount.

>> So let's see: first you say that my scheme has to be variable length because I
>> am using two bytes to handle these languages,
>
> Well, it *is* variable length or you have to disregard Chinese. You cannot have it both ways. Code to deal with two bytes is significantly different than code to deal with one. That means you've got a conditional in your generic code - that isn't going to be faster than the conditional for UTF-8.
Hah, I have explicitly said several times that I'd use a two-byte encoding for Chinese and I already acknowledged that such a predominantly single-byte encoding is still variable-length.  The problem is that _you_ try to have it both ways: first you claimed it is variable-length because I support Chinese that way, then you claimed I don't support Chinese.

Yes, there will be conditionals, just as there are several conditionals in phobos depending on whether a language supports uppercase or not.  The question is whether the conditionals for single-byte encoding will execute faster than decoding every UTF-8 character.  This is a matter of engineering judgement, I see no reason why you think decoding every UTF-8 character is faster.

>> then you claim I don't handle
>> these languages.  This kind of blatant contradiction within two posts can only
>> be called... trolling!
>
> You gave some vague handwaving about it, and then dismissed it as irrelevant, along with more handwaving about what to do with text that has embedded words in multiple languages.
If it was mere "vague handwaving," how did you know I planned to use two bytes to encode Chinese?  I'm not sure why you're continuing along this contradictory path.

I didn't "handwave" about multi-language strings, I gave specific ideas about how they might be implemented.  I'm not claiming to have a bullet-proof and detailed single-byte encoding spec, just spitballing some ideas on how to do it better than the abominable UTF-8.

> Worse, there are going to be more than 256 of these encodings - you can't even have a byte to specify them. Remember, Unicode has approximately 256,000 characters in it. How many code pages is that?
There are 72 modern scripts in Unicode 6.1, 28 ancient scripts, maybe another 50 symbolic sets.  That leaves space for another 100 or so new scripts.  Maybe you are so worried about future-proofing that you'd use two bytes to signify the alphabet, but I wouldn't.  I think it's more likely that we'll ditch scripts than add them. ;) Most of those symbol sets should not be in UCS.

> I was being kind saying you were trolling, as otherwise I'd be saying your scheme was, to be blunt, absurd.
I think it's absurd to use a self-synchronizing text encoding from 20 years ago, that is really only useful when streaming text, which nobody does today.  There may have been a time when ASCII compatibility was paramount, when nobody cared about internationalization and almost all libraries only took ASCII input: that is not the case today.

> I'll be the first to admit that a lot of great ideas have been initially dismissed by the experts as absurd. If you really believe in this, I recommend that you write it up as a real article, taking care to fill in all the handwaving with something specific, and include some benchmarks to prove your performance claims. Post your article on reddit, stackoverflow, hackernews, etc., and look for fertile ground for it. I'm sorry you're not finding fertile ground here (so far, nobody has agreed with any of your points), and this is the wrong place for such proposals anyway, as D is simply not going to switch over to it.
Let me admit in return that I might be completely wrong about my single-byte encoding representing a step forward from UTF-8.  While this argument has produced no argument that I'm wrong, it's possible we've all missed something salient, some deal-breaker.  As I said before, I'm not proposing that D "switch over."  I was simply asking people who know or at the very least use UTF-8 more than most, as a result of employing one of the few languages with Unicode support baked in, why they think UTF-8 is a good idea.

I was hoping for a technical discussion on the merits, before I went ahead and implemented this single-byte encoding.  Since nobody has been able to point out a reason for why my encoding wouldn't be much better than UTF-8, I see no reason not to go forward with my implementation.  I may write something up after implementation: most people don't care about ideas, only results, to the point where almost nobody can reason at all about ideas.

> Remember, extraordinary claims require extraordinary evidence, not handwaving and assumptions disguised as bold assertions.
I don't think my claims are extraordinary or backed by "handwaving and assumptions."  Some people can reason about such possible encodings, even in the incomplete form I've sketched out, without having implemented them, if they know what they're doing.

On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
> On 5/25/2013 2:51 PM, Walter Bright wrote:
>> On 5/25/2013 12:51 PM, Joakim wrote:
>>> For a multi-language string encoding, the header would
>>> contain a single byte for every language used in the string, along with multiple
>>> index bytes to signify the start and finish of every run of single-language
>>> characters in the string. So, a list of languages and a list of pure
>>> single-language substrings.
>>
>> Please implement the simple C function strstr() with this simple scheme, and
>> post it here.
>>
>> http://www.digitalmars.com/rtl/string.html#strstr
>
> I'll go first. Here's a simple UTF-8 version in C. It's not the fastest way to do it, but at least it is correct:
> ----------------------------------
> char *strstr(const char *s1,const char *s2) {
>     size_t len1 = strlen(s1);
>     size_t len2 = strlen(s2);
>     if (!len2)
>         return (char *) s1;
>     char c2 = *s2;
>     while (len2 <= len1) {
>         if (c2 == *s1)
>             if (memcmp(s2,s1,len2) == 0)
>                 return (char *) s1;
>         s1++;
>         len1--;
>     }
>     return NULL;
> }
There is no question that a UTF-8 implementation of strstr can be simpler to write in C and D for multi-language strings that include Korean/Chinese/Japanese.  But while the strstr implementation for my encoding would contain more conditionals and lines of code, it would be far more efficient.  For instance, because you know where all the language substrings are from the header, you can potentially rule out searching vast swathes of the string, because they don't contain the same languages or lengths as the string you're searching for.

Even if you're searching a single-language string, which won't have those speedups, your naive implementation checks every byte, even continuation bytes, in UTF-8 to see if they might match the first letter of the search string, even though no continuation byte will match.  You can avoid this by partially decoding the leading bytes of UTF-8 characters and skipping over continuation bytes, as I've mentioned earlier in this thread, but you've then added more lines of code to your pretty yet simple function and added decoding overhead to every iteration of the while loop.

My single-byte encoding has none of these problems, in fact, it's much faster and uses less memory for the same function, while providing additional speedups, from the header, that are not available to UTF-8.

Finally, being able to write simple yet inefficient functions like this is not the test of a good encoding, as strstr is a library function, and making library developers' lives easier is a low priority for any good format.  The primary goals are ease of use for library consumers, ie app developers, and speed and efficiency of the code.  You are trading on the latter two for the former with this implementation.  That is not a good tradeoff.

Perhaps it was a good trade 20 years ago when everyone rolled their own code and nobody bothered waiting for those floppy disks to arrive with expensive library code.  It is not a good trade today.

May 26, 2013

Re: Why UTF-8/16 character encodings?

Posted by Declan
in reply to Joakim

Declan

Posted in reply to Joakim

On Sunday, 26 May 2013 at 11:31:31 UTC, Joakim wrote:
> On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
>>> I have noted from the beginning that these large alphabets have to be encoded to
>>> two bytes, so it is not a true constant-width encoding if you are mixing one of
>>> those languages into a single-byte encoded string.  But this "variable length"
>>> encoding is so much simpler than UTF-8, there's no comparison.
>>
>> If it's one byte sometimes, or two bytes sometimes, it's variable length. You overlook that I've had to deal with this. It isn't "simpler", there's actually more work to write code that adapts to one or two byte encodings.
> It is variable length, with the advantage that only strings containing a few Asian languages are variable-length, as opposed to UTF-8 having every non-English language string be variable-length.  It may be more work to write library code to handle my encoding, perhaps, but efficiency and ease of use are paramount.
>
>>> So let's see: first you say that my scheme has to be variable length because I
>>> am using two bytes to handle these languages,
>>
>> Well, it *is* variable length or you have to disregard Chinese. You cannot have it both ways. Code to deal with two bytes is significantly different than code to deal with one. That means you've got a conditional in your generic code - that isn't going to be faster than the conditional for UTF-8.
> Hah, I have explicitly said several times that I'd use a two-byte encoding for Chinese and I already acknowledged that such a predominantly single-byte encoding is still variable-length.  The problem is that _you_ try to have it both ways: first you claimed it is variable-length because I support Chinese that way, then you claimed I don't support Chinese.
>
> Yes, there will be conditionals, just as there are several conditionals in phobos depending on whether a language supports uppercase or not.  The question is whether the conditionals for single-byte encoding will execute faster than decoding every UTF-8 character.  This is a matter of engineering judgement, I see no reason why you think decoding every UTF-8 character is faster.
>
>>> then you claim I don't handle
>>> these languages.  This kind of blatant contradiction within two posts can only
>>> be called... trolling!
>>
>> You gave some vague handwaving about it, and then dismissed it as irrelevant, along with more handwaving about what to do with text that has embedded words in multiple languages.
> If it was mere "vague handwaving," how did you know I planned to use two bytes to encode Chinese?  I'm not sure why you're continuing along this contradictory path.
>
> I didn't "handwave" about multi-language strings, I gave specific ideas about how they might be implemented.  I'm not claiming to have a bullet-proof and detailed single-byte encoding spec, just spitballing some ideas on how to do it better than the abominable UTF-8.
>
>> Worse, there are going to be more than 256 of these encodings - you can't even have a byte to specify them. Remember, Unicode has approximately 256,000 characters in it. How many code pages is that?
> There are 72 modern scripts in Unicode 6.1, 28 ancient scripts, maybe another 50 symbolic sets.  That leaves space for another 100 or so new scripts.  Maybe you are so worried about future-proofing that you'd use two bytes to signify the alphabet, but I wouldn't.  I think it's more likely that we'll ditch scripts than add them. ;) Most of those symbol sets should not be in UCS.
>
>> I was being kind saying you were trolling, as otherwise I'd be saying your scheme was, to be blunt, absurd.
> I think it's absurd to use a self-synchronizing text encoding from 20 years ago, that is really only useful when streaming text, which nobody does today.  There may have been a time when ASCII compatibility was paramount, when nobody cared about internationalization and almost all libraries only took ASCII input: that is not the case today.
>
>> I'll be the first to admit that a lot of great ideas have been initially dismissed by the experts as absurd. If you really believe in this, I recommend that you write it up as a real article, taking care to fill in all the handwaving with something specific, and include some benchmarks to prove your performance claims. Post your article on reddit, stackoverflow, hackernews, etc., and look for fertile ground for it. I'm sorry you're not finding fertile ground here (so far, nobody has agreed with any of your points), and this is the wrong place for such proposals anyway, as D is simply not going to switch over to it.
> Let me admit in return that I might be completely wrong about my single-byte encoding representing a step forward from UTF-8.
>  While this argument has produced no argument that I'm wrong, it's possible we've all missed something salient, some deal-breaker.  As I said before, I'm not proposing that D "switch over."  I was simply asking people who know or at the very least use UTF-8 more than most, as a result of employing one of the few languages with Unicode support baked in, why they think UTF-8 is a good idea.
>
> I was hoping for a technical discussion on the merits, before I went ahead and implemented this single-byte encoding.  Since nobody has been able to point out a reason for why my encoding wouldn't be much better than UTF-8, I see no reason not to go forward with my implementation.  I may write something up after implementation: most people don't care about ideas, only results, to the point where almost nobody can reason at all about ideas.
>
>> Remember, extraordinary claims require extraordinary evidence, not handwaving and assumptions disguised as bold assertions.
> I don't think my claims are extraordinary or backed by "handwaving and assumptions."  Some people can reason about such possible encodings, even in the incomplete form I've sketched out, without having implemented them, if they know what they're doing.
>
> On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
>> On 5/25/2013 2:51 PM, Walter Bright wrote:
>>> On 5/25/2013 12:51 PM, Joakim wrote:
>>>> For a multi-language string encoding, the header would
>>>> contain a single byte for every language used in the string, along with multiple
>>>> index bytes to signify the start and finish of every run of single-language
>>>> characters in the string. So, a list of languages and a list of pure
>>>> single-language substrings.
>>>
>>> Please implement the simple C function strstr() with this simple scheme, and
>>> post it here.
>>>
>>> http://www.digitalmars.com/rtl/string.html#strstr
>>
>> I'll go first. Here's a simple UTF-8 version in C. It's not the fastest way to do it, but at least it is correct:
>> ----------------------------------
>> char *strstr(const char *s1,const char *s2) {
>>    size_t len1 = strlen(s1);
>>    size_t len2 = strlen(s2);
>>    if (!len2)
>>        return (char *) s1;
>>    char c2 = *s2;
>>    while (len2 <= len1) {
>>        if (c2 == *s1)
>>            if (memcmp(s2,s1,len2) == 0)
>>                return (char *) s1;
>>        s1++;
>>        len1--;
>>    }
>>    return NULL;
>> }
> There is no question that a UTF-8 implementation of strstr can be simpler to write in C and D for multi-language strings that include Korean/Chinese/Japanese.  But while the strstr implementation for my encoding would contain more conditionals and lines of code, it would be far more efficient.  For instance, because you know where all the language substrings are from the header, you can potentially rule out searching vast swathes of the string, because they don't contain the same languages or lengths as the string you're searching for.
>
> Even if you're searching a single-language string, which won't have those speedups, your naive implementation checks every byte, even continuation bytes, in UTF-8 to see if they might match the first letter of the search string, even though no continuation byte will match.  You can avoid this by partially decoding the leading bytes of UTF-8 characters and skipping over continuation bytes, as I've mentioned earlier in this thread, but you've then added more lines of code to your pretty yet simple function and added decoding overhead to every iteration of the while loop.
>
> My single-byte encoding has none of these problems, in fact, it's much faster and uses less memory for the same function, while providing additional speedups, from the header, that are not available to UTF-8.
>
> Finally, being able to write simple yet inefficient functions like this is not the test of a good encoding, as strstr is a library function, and making library developers' lives easier is a low priority for any good format.  The primary goals are ease of use for library consumers, ie app developers, and speed and efficiency of the code.  You are trading on the latter two for the former with this implementation.  That is not a good tradeoff.
>
> Perhaps it was a good trade 20 years ago when everyone rolled their own code and nobody bothered waiting for those floppy disks to arrive with expensive library code.  It is not a good trade today.

I服了u，I'm thinking of your name means joking?

May 26, 2013

Re: Why UTF-8/16 character encodings?

Posted by John Colvin
in reply to Joakim

John Colvin

Posted in reply to Joakim

On Sunday, 26 May 2013 at 11:31:31 UTC, Joakim wrote:
> On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
>>> I have noted from the beginning that these large alphabets have to be encoded to
>>> two bytes, so it is not a true constant-width encoding if you are mixing one of
>>> those languages into a single-byte encoded string.  But this "variable length"
>>> encoding is so much simpler than UTF-8, there's no comparison.
>>
>> If it's one byte sometimes, or two bytes sometimes, it's variable length. You overlook that I've had to deal with this. It isn't "simpler", there's actually more work to write code that adapts to one or two byte encodings.
> It is variable length, with the advantage that only strings containing a few Asian languages are variable-length, as opposed to UTF-8 having every non-English language string be variable-length.  It may be more work to write library code to handle my encoding, perhaps, but efficiency and ease of use are paramount.
>
>>> So let's see: first you say that my scheme has to be variable length because I
>>> am using two bytes to handle these languages,
>>
>> Well, it *is* variable length or you have to disregard Chinese. You cannot have it both ways. Code to deal with two bytes is significantly different than code to deal with one. That means you've got a conditional in your generic code - that isn't going to be faster than the conditional for UTF-8.
> Hah, I have explicitly said several times that I'd use a two-byte encoding for Chinese and I already acknowledged that such a predominantly single-byte encoding is still variable-length.  The problem is that _you_ try to have it both ways: first you claimed it is variable-length because I support Chinese that way, then you claimed I don't support Chinese.
>
> Yes, there will be conditionals, just as there are several conditionals in phobos depending on whether a language supports uppercase or not.  The question is whether the conditionals for single-byte encoding will execute faster than decoding every UTF-8 character.  This is a matter of engineering judgement, I see no reason why you think decoding every UTF-8 character is faster.
>
>>> then you claim I don't handle
>>> these languages.  This kind of blatant contradiction within two posts can only
>>> be called... trolling!
>>
>> You gave some vague handwaving about it, and then dismissed it as irrelevant, along with more handwaving about what to do with text that has embedded words in multiple languages.
> If it was mere "vague handwaving," how did you know I planned to use two bytes to encode Chinese?  I'm not sure why you're continuing along this contradictory path.
>
> I didn't "handwave" about multi-language strings, I gave specific ideas about how they might be implemented.  I'm not claiming to have a bullet-proof and detailed single-byte encoding spec, just spitballing some ideas on how to do it better than the abominable UTF-8.
>
>> Worse, there are going to be more than 256 of these encodings - you can't even have a byte to specify them. Remember, Unicode has approximately 256,000 characters in it. How many code pages is that?
> There are 72 modern scripts in Unicode 6.1, 28 ancient scripts, maybe another 50 symbolic sets.  That leaves space for another 100 or so new scripts.  Maybe you are so worried about future-proofing that you'd use two bytes to signify the alphabet, but I wouldn't.  I think it's more likely that we'll ditch scripts than add them. ;) Most of those symbol sets should not be in UCS.
>
>> I was being kind saying you were trolling, as otherwise I'd be saying your scheme was, to be blunt, absurd.
> I think it's absurd to use a self-synchronizing text encoding from 20 years ago, that is really only useful when streaming text, which nobody does today.  There may have been a time when ASCII compatibility was paramount, when nobody cared about internationalization and almost all libraries only took ASCII input: that is not the case today.
>
>> I'll be the first to admit that a lot of great ideas have been initially dismissed by the experts as absurd. If you really believe in this, I recommend that you write it up as a real article, taking care to fill in all the handwaving with something specific, and include some benchmarks to prove your performance claims. Post your article on reddit, stackoverflow, hackernews, etc., and look for fertile ground for it. I'm sorry you're not finding fertile ground here (so far, nobody has agreed with any of your points), and this is the wrong place for such proposals anyway, as D is simply not going to switch over to it.
> Let me admit in return that I might be completely wrong about my single-byte encoding representing a step forward from UTF-8.
>  While this argument has produced no argument that I'm wrong, it's possible we've all missed something salient, some deal-breaker.  As I said before, I'm not proposing that D "switch over."  I was simply asking people who know or at the very least use UTF-8 more than most, as a result of employing one of the few languages with Unicode support baked in, why they think UTF-8 is a good idea.
>
> I was hoping for a technical discussion on the merits, before I went ahead and implemented this single-byte encoding.  Since nobody has been able to point out a reason for why my encoding wouldn't be much better than UTF-8, I see no reason not to go forward with my implementation.  I may write something up after implementation: most people don't care about ideas, only results, to the point where almost nobody can reason at all about ideas.
>
>> Remember, extraordinary claims require extraordinary evidence, not handwaving and assumptions disguised as bold assertions.
> I don't think my claims are extraordinary or backed by "handwaving and assumptions."  Some people can reason about such possible encodings, even in the incomplete form I've sketched out, without having implemented them, if they know what they're doing.
>
> On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
>> On 5/25/2013 2:51 PM, Walter Bright wrote:
>>> On 5/25/2013 12:51 PM, Joakim wrote:
>>>> For a multi-language string encoding, the header would
>>>> contain a single byte for every language used in the string, along with multiple
>>>> index bytes to signify the start and finish of every run of single-language
>>>> characters in the string. So, a list of languages and a list of pure
>>>> single-language substrings.
>>>
>>> Please implement the simple C function strstr() with this simple scheme, and
>>> post it here.
>>>
>>> http://www.digitalmars.com/rtl/string.html#strstr
>>
>> I'll go first. Here's a simple UTF-8 version in C. It's not the fastest way to do it, but at least it is correct:
>> ----------------------------------
>> char *strstr(const char *s1,const char *s2) {
>>    size_t len1 = strlen(s1);
>>    size_t len2 = strlen(s2);
>>    if (!len2)
>>        return (char *) s1;
>>    char c2 = *s2;
>>    while (len2 <= len1) {
>>        if (c2 == *s1)
>>            if (memcmp(s2,s1,len2) == 0)
>>                return (char *) s1;
>>        s1++;
>>        len1--;
>>    }
>>    return NULL;
>> }
> There is no question that a UTF-8 implementation of strstr can be simpler to write in C and D for multi-language strings that include Korean/Chinese/Japanese.  But while the strstr implementation for my encoding would contain more conditionals and lines of code, it would be far more efficient.  For instance, because you know where all the language substrings are from the header, you can potentially rule out searching vast swathes of the string, because they don't contain the same languages or lengths as the string you're searching for.
>
> Even if you're searching a single-language string, which won't have those speedups, your naive implementation checks every byte, even continuation bytes, in UTF-8 to see if they might match the first letter of the search string, even though no continuation byte will match.  You can avoid this by partially decoding the leading bytes of UTF-8 characters and skipping over continuation bytes, as I've mentioned earlier in this thread, but you've then added more lines of code to your pretty yet simple function and added decoding overhead to every iteration of the while loop.
>
> My single-byte encoding has none of these problems, in fact, it's much faster and uses less memory for the same function, while providing additional speedups, from the header, that are not available to UTF-8.
>
> Finally, being able to write simple yet inefficient functions like this is not the test of a good encoding, as strstr is a library function, and making library developers' lives easier is a low priority for any good format.  The primary goals are ease of use for library consumers, ie app developers, and speed and efficiency of the code.  You are trading on the latter two for the former with this implementation.  That is not a good tradeoff.
>
> Perhaps it was a good trade 20 years ago when everyone rolled their own code and nobody bothered waiting for those floppy disks to arrive with expensive library code.  It is not a good trade today.

I suggest you make an attempt at writing strstr and post it. Code speaks louder than words.

May 26, 2013

Re: Why UTF-8/16 character encodings?

Posted by Walter Bright
in reply to Joakim

Walter Bright

Posted in reply to Joakim

On 5/26/2013 4:31 AM, Joakim wrote:
> My single-byte encoding has none of these problems, in fact, it's much faster
> and uses less memory for the same function, while providing additional speedups,
> from the header, that are not available to UTF-8.

C'mon, Joakim, show us this amazing strstr() implementation for your scheme!

http://www.youtube.com/watch?v=dhRUe-gz690

May 26, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Walter Bright

Joakim

Posted in reply to Walter Bright

On Sunday, 26 May 2013 at 12:55:11 UTC, Walter Bright wrote:
> On 5/26/2013 4:31 AM, Joakim wrote:
>> My single-byte encoding has none of these problems, in fact, it's much faster
>> and uses less memory for the same function, while providing additional speedups,
>> from the header, that are not available to UTF-8.
>
> C'mon, Joakim, show us this amazing strstr() implementation for your scheme!
You will see it when it's built into a fully working single-byte encoding implementation.  I don't write toy code, particularly inefficient functions like yours, for the reasons given, which seem to have gone over your head.

> http://www.youtube.com/watch?v=dhRUe-gz690
Heh, never seen that sketch before.  Never understood why anyone likes this silly Monty Python stuff, from what little I've seen.

May 26, 2013

Re: Why UTF-8/16 character encodings?

Posted by H. S. Teoh
in reply to Joakim

H. S. Teoh

Posted in reply to Joakim

On Sun, May 26, 2013 at 11:59:19AM +0200, Joakim wrote:
> On Saturday, 25 May 2013 at 20:52:41 UTC, H. S. Teoh wrote:
> >And just how exactly does that help with slicing? If anything, it makes slicing way hairier and error-prone than UTF-8. In fact, this one point alone already defeated any performance gains you may have had with a single-byte encoding. Now you can't do *any* slicing at all without convoluted algorithms to determine what encoding is where at the endpoints of your slice, and the resulting slice must have new headers to indicate the start/end of every different-language substring.  By the time you're done with all that, you're going way slower than processing UTF-8.
>
> There are no convoluted algorithms, it's a simple check if the string contains any two-bye encodings, a check which can be done once and cached.

IHBT. You said that to handle multilanguage strings, your header would have a list of starting/ending points indicating which encoding should be used for which substring(s). That has nothing to do with two-byte encodings. So, please show us the code: given a string containing, say, English and French substrings, what will the header look like? And what's the algorithm to take a slice of such a string?

> If it's single-byte all the way through, no problems whatsoever with slicing.

Huh?! How are there no problems with slicing? Let's say you have a string that contains both English and French. According to your scheme, you'll have some kind of header format that lets you say bytes 0-123 are English, bytes 124-129 are French, and bytes 130-200 are English. Now let's say I want a substring from 120 to 125. How would this be done? And what about if I want a substring from 120 to 140? Or 126 to 130? What if the string contains several runs of French?

Please show us the code.

> If there are two-byte languages included, the slice function will have to do a little arithmetic calculation before slicing.  You will also need a few arithmetic ops to create the new header for the slice.  The point is that these operations will be much faster than decoding every code point to slice UTF-8.

You haven't proven that this "little arithmetic calculation" will be faster than manipulating UTF-8. What if I have an English text that contains quotations of Chinese, French, and Greek snippets? Math symbols?  Please show us (1) how such a string should be encoded under your scheme, and (2) the code will slice such a string in an efficient way, according to your proposed encoding scheme.

(And before you dismiss such a string as unlikely or write it off as rare, consider a technical math paper that cites the work of Chinese and French authors -- a rather common thing these days. You'd need the extra characters just to be able to cite their names, even if none of the actual Chinese or French is quoted verbatim. Greek in general is used all over math anyway, since for whatever reason mathematicians just love Greek symbols, so it pretty much needs to be included by default.)

> >Again I say, I'm not 100% sold on UTF-8, but what you're proposing here is far worse.
> Well, I'm glad you realize some problems with UTF-8, :) even if you dismiss my alternative out of hand.

Clearly, we're not seeing what you're seeing here. So instead of making general statements about the superiority of your scheme, you might want to show us the actual code.  So far, I haven't seen anything that convinces me that your scheme is any better.  In fact, from what I can see, it's a lot worse, and you're just evading pointed questions about how to address those problems.  Maybe that's a wrong perception, but not having any actual code to look at, I'm having a hard time believing your claims. Right now I'm leaning towards agreeing with Walter that you're just trolling us (and rather successfully at that).

So, please show us the code. Otherwise, I think I should just stop responding, as we're obviously not on the same page and this discussion isn't getting anywhere.

T

-- 
Some ideas are so stupid that only intellectuals could believe them. -- George Orwell

May 26, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to H. S. Teoh

Joakim

Posted in reply to H. S. Teoh

On Sunday, 26 May 2013 at 14:37:27 UTC, H. S. Teoh wrote:
> IHBT. You said that to handle multilanguage strings, your
Pretty funny how you claim you've been trolled and then go on to make a bunch of trolling arguments, which seem to imply you have no idea how a single-byte encoding works.  I'm not going to bother explaining it to you, anyone who knows encodings can easily figure it out from what I've said so far.

> Clearly, we're not seeing what you're seeing here. So instead of making
> general statements about the superiority of your scheme, you might want
> to show us the actual code.  So far, I haven't seen anything that
> convinces me that your scheme is any better.  In fact, from what I can
> see, it's a lot worse, and you're just evading pointed questions about
> how to address those problems.  Maybe that's a wrong perception, but not
> having any actual code to look at, I'm having a hard time believing your
> claims. Right now I'm leaning towards agreeing with Walter that you're
> just trolling us (and rather successfully at that).
When someone makes arguments that fly over your head, that's not trolling, that's you not understanding what they're saying.  I have demolished every claim that has been made about single-byte encoding being worse.  If you can't understand my arguments, you need to go out and learn some more about these issues.

> So, please show us the code. Otherwise, I think I should just stop
> responding, as we're obviously not on the same page and this discussion
> isn't getting anywhere.
I've made my position clear: I don't write toy code.  It will take too long for the kind of encoding I have in mind, so it isn't worth my time, and if you can't understand the higher-level technical language I'm using in these posts, you won't understand the code anyway.  I have adequately sketched what I'd do, so that anyone proficient in the art can reason about what the consequences of such a scheme would be.  Perhaps that doesn't include Walter and you.

I don't know why you'd want to keep responding to someone you think is trolling you anyway.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation