[OT] Effect of UTF-8 on 2G connections (page 25)

June 01, 2016

Re: [OT] Effect of UTF-8 on 2G connections

Posted by Joakim
in reply to Marco Leise

Permalink

Joakim

Posted in reply to Marco Leise

Permalink

On Wednesday, 1 June 2016 at 14:58:47 UTC, Marco Leise wrote:
> Am Wed, 01 Jun 2016 13:57:27 +0000
> schrieb Joakim <dlang@joakim.fea.st>:
>
>> No, I explicitly said not the web in a subsequent post.  The ignorance here of what 2G speeds are like is mind-boggling.
>
> I've used 56k and had a phone conversation with my sister while she was downloading a 800 MiB file over 2G. You just learn to be patient (or you already are when the next major city is hundreds of kilometers away) and load only what you need. Your point about the costs convinced me more.

I see that max 2G speeds are 100-200 kbits/s.  At that rate, it would have taken her more than 10 hours to download such a large file, that's nuts.  The worst part is when the download gets interrupted and you have to start over again because most download managers don't know how to resume, including the stock one on Android.

Also, people in these countries buy packs of around 100-200 MB for 30-60 US cents, so they would never download such a large file.  They use messaging apps like Whatsapp or WeChat, which nobody in the US uses, to avoid onerous SMS charges.

> Here is one article spiced up with numbers and figures: http://www.thequint.com/technology/2016/05/30/almost-every-indian-may-be-online-if-data-cost-cut-to-one-third

Yes, only the middle class, which are at most 10-30% of the population in these developing countries, can even afford 2G.  The way to get costs down even further is to make the tech as efficient as possible.  Of course, much of the rest of the population are illiterate, so there are bigger problems there.

> But even if you could prove with a study that UTF-8 caused a
> notable bandwith cost in real life, it would - I think - be a
> matter of regional ISPs to provide special servers and apps
> that reduce data volume.

Yes, by ditching UTF-8.

> There is also the overhead of
> key exchange when establishing a secure connection:
> http://stackoverflow.com/a/20306907/4038614
> Something every app should do, but will increase bandwidth use.

That's not going to happen, even HTTP/2 ditched that requirement.  Also, many of those countries' govts will not allow it: google how Blackberry had to give up their keys for "secure" BBM in many countries.  It's not just Canada and the US spying on their citizens.

> Then there is the overhead of using XML in applications
> like WhatsApp, which I presume is quite popular around the
> world. I'm just trying to broaden the view a bit here.

I didn't know they used XML.  Googling it now, I see mention that they switched to an "internally developed protocol" at some point, so I doubt they're using XML now.

> This note from the XMPP that WhatsApp and Jabber use will make
> you cringe: https://tools.ietf.org/html/rfc6120#section-11.6

Haha, no wonder Jabber is dead. :) I jumped on Jabber for my own messages a decade ago, as it seemed like an open way out of that proprietary messaging mess, then I read that they're using XML and gave up on it.

On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
> On Wednesday, 1 June 2016 at 13:57:27 UTC, Joakim wrote:
>>
>> No, I explicitly said not the web in a subsequent post.  The ignorance here of what 2G speeds are like is mind-boggling.
>>
> It's not hard.  I think a lot of us remember when a 14.4 modem was cutting-edge.

Well, then apparently you're unaware of how bloated web pages are nowadays.  It used to take me minutes to download popular web pages _back then_ at _top speed_, and those pages were a _lot_ smaller.

> Codepages and incompatible encodings were terrible then, too.
>
> Never again.

This only shows you probably don't know the difference between an encoding and a code page, which are orthogonal concepts in Unicode.  It's not surprising, as Walter and many others responding show the same ignorance.  I explained this repeatedly in the previous thread, but it depends on understanding the tech, and I can't spoon-feed that to everyone.

>> Well, when you _like_ a ludicrous encoding like UTF-8, not sure your opinion matters.
>
> It _is_ kind of ludicrous, isn't it?  But it really is the least-bad option for the most text.  Sorry, bub.

I think we can do a lot better.

>>> No. The common string-handling use case is code that is unaware which script (not language, btw) your text is in.
>>
>> Lol, this may be the dumbest argument put forth yet.
>
> This just makes it feel like you're trolling.  You're not just trolling, right?

Are you trolling?  Because I was just calling it like it is.

The vast majority of software is written for _one_ language, the local one.  You may think otherwise because the software that sells the most and makes the most money is internationalized software like Windows or iOS, because it can be resold into many markets.  But as a percentage of lines of code written, such international code is almost nothing.

>> I don't think anyone here even understands what a good encoding is and what it's for, which is why there's no point in debating this.
>
> And I don't think you realise how backwards you sound to people who had to live through the character encoding hell of the past.  This has been an ongoing headache for the better part of a century (it still comes up in old files, sites, and systems) and you're literally the only person I've ever seen seriously suggest we turn back now that the madness has been somewhat tamed.

No, I have never once suggested "turning back."  I have suggested a new scheme that retains one technical aspect of the prior schemes, ie constant-width encoding for each language, with a single byte sufficing for most.  _You and several others_, including Walter, see that and automatically translate that to, "He wants EBCDIC to come back!," as though that were the only possible single-byte encoding and largely ignoring the possibilities of the header scheme I suggested.

I could call that "trolling" by all of you, :) but I'll instead call it what it likely is, reactionary thinking, and move on.

> If you have to deal with delivering the fastest possible i18n at GSM data rates, well, that's a tough problem and it sounds like you might need to do something pretty special. Turning the entire ecosystem into your special case is not the answer.

I don't think you understand: _you_ are the special case.  The 5 billion people outside the US and EU are _not the special case_.  Yes, they have not mattered so far, because they were too poor to buy computers.  But the "computers" with the most sales these days are smartphones, and Motorola just launched their new Moto G4 in India and Samsung their new C5 and C7 in China.  They didn't bother announcing release dates for these mid-range phones- well, they're high-end in those countries- in the US.  That's because "computer" sales in all these non-ASCII countries now greatly outweighs the US.

Now, a large majority of people in those countries don't have smartphones or text each other, so a significant chunk of the minority who do buy mostly ~$100 smartphones over there can likely afford a fatter text encoding and I don't know what encodings these developing markets are commonly using now.  The problem is all the rest, and those just below who cannot afford it at all, in part because the tech is not as efficient as it could be yet.  Ditching UTF-8 will be one way to make it more efficient.

On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter wrote:
> Indeed, Joakim's proposal is so insane it beggars belief (why not go back to baudot encoding, it's only 5 bit, hurray, it's so much faster when used with flag semaphores).

I suspect you don't understand my proposal.

> As a programmer in the European Commission translation unit, working on the probably biggest translation memory in the world for 14 years, I can attest that Unicode is a blessing. When I remember the shit we had in our documents because of the code pages before most programs could handle utf-8 or utf-16 (and before 2004 we only had 2 alphabets to take care of, Western and Greek). What Joakim does not understand, is that there are huge, huge quantities of documents that are multi-lingual.

Oh, I'm well aware of this.  I just think a variable-length encoding like UTF-8 or UTF-16 is a bad design.  And what you have to realize is that most strings in most software will only have one language.  Anyway, the scheme I sketched out handles multiple languages: it just doesn't optimize for completely random jumbles of characters from every possible language, which is what UTF-8 is optimized for and is a ridiculous decision.

> Translators of course handle nearly exclusively with at least bi-lingual documents. Any document encountered by a translator must at least be able to present the source and the target language. But even outside of that specific population, multilingual documents are very, very common.

You are likely biased by the fact that all your documents are bilingual: they're _not_ common for the vast majority of users.  Even if they were, UTF-8 is as suboptimal, compared to the constant-width encoding scheme I've sketched, for bilingual or even trilingual documents as it is for a single language, so even if I were wrong about their frequency, it wouldn't matter.

On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote: > If you have to deal with delivering the fastest possible i18n at GSM data rates, well, that's a tough problem and it sounds like you might need to do something pretty special. Turning the entire ecosystem into your special case is not the answer. UTF-8 encoded SMS work fine for me in GSM network, didn't notice any problem.

On 06/01/2016 12:26 PM, deadalnix wrote: > On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter wrote: >> What Joakim does not understand, is that there are huge, huge >> quantities of documents that are multi-lingual. > > That should be obvious to anyone living outside the USA. > Or anyone in the USA who's ever touched a product that includes a manual or a safety warning, or gone to high school (a foreign language class is pretty much universally mandatory, even in the US).

On Wednesday, 1 June 2016 at 16:26:36 UTC, deadalnix wrote: > On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter wrote: >> What Joakim does not understand, is that there are huge, huge quantities of documents that are multi-lingual. > > That should be obvious to anyone living outside the USA. https://msdn.microsoft.com/th-th inside too :)

On 06/01/2016 12:41 PM, Nick Sabalausky wrote: > As has been explained countless times already, code points are a non-1:1 > internal representation of graphemes. Code points don't exist for their > own sake, their entire existence is purely as a way to encode graphemes. Of course, thank you. > Whether that technically qualifies as "memory representation" or not is > irrelevant: it's still a low-level implementation detail of text. The relevance is meandering across the discussion, and it's good to have the same definitions for terms. Unicode code points are abstract notions with meanings attached to them, whereas UTF8/16/32 are concerned with their representation. Andrei

On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu wrote: > On 05/31/2016 02:46 PM, Timon Gehr wrote: >> On 31.05.2016 20:30, Andrei Alexandrescu wrote: >>> D's >> >> Phobos' > > foreach, too. -- Andrei Incorrect. https://dpaste.dzfl.pl/ba7a65d59534

On 06/01/2016 01:35 PM, ZombineDev wrote: > On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu wrote: >> On 05/31/2016 02:46 PM, Timon Gehr wrote: >>> On 31.05.2016 20:30, Andrei Alexandrescu wrote: >>>> D's >>> >>> Phobos' >> >> foreach, too. -- Andrei > > Incorrect. https://dpaste.dzfl.pl/ba7a65d59534 Try typing the iteration variable with "dchar". -- Andrei

On Wednesday, 1 June 2016 at 17:57:15 UTC, Andrei Alexandrescu wrote: > Try typing the iteration variable with "dchar". -- Andrei Or you can type it as wchar... But important to note: that's opt in, not automatic.

On Wednesday, 1 June 2016 at 16:45:04 UTC, Joakim wrote: > On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote: >> It's not hard. I think a lot of us remember when a 14.4 modem was cutting-edge. > > Well, then apparently you're unaware of how bloated web pages are nowadays. It used to take me minutes to download popular web pages _back then_ at _top speed_, and those pages were a _lot_ smaller. It's telling that you think the encoding of the text is anything but the tiniest fraction of the problem. You should look at where the actual weight of a "modern" web page comes from. >> Codepages and incompatible encodings were terrible then, too. >> >> Never again. > > This only shows you probably don't know the difference between an encoding and a code page, "I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_" Yeah, that? That's codepages. And your exact proposal to put encodings in the header was ALSO tried around the time that Unicode was getting hashed out. It sucked. A lot. (Not as bad as storing it in the directory metadata, though.) >>> Well, when you _like_ a ludicrous encoding like UTF-8, not sure your opinion matters. >> >> It _is_ kind of ludicrous, isn't it? But it really is the least-bad option for the most text. Sorry, bub. > > I think we can do a lot better. Maybe. But no one's done it yet. > The vast majority of software is written for _one_ language, the local one. You may think otherwise because the software that sells the most and makes the most money is internationalized software like Windows or iOS, because it can be resold into many markets. But as a percentage of lines of code written, such international code is almost nothing. I'm surprised you think this even matters after talking about web pages. The browser is your most common string processing situation. Nothing else even comes close. > largely ignoring the possibilities of the header scheme I suggested. "Possibilities" that were considered and discarded decades ago by people with way better credentials. The era of single-byte encodings is gone, it won't come back, and good riddance to bad rubbish. > I could call that "trolling" by all of you, :) but I'll instead call it what it likely is, reactionary thinking, and move on. It's not trolling to call you out for clearly not doing your homework. > I don't think you understand: _you_ are the special case. Oh, I understand perfectly. _We_ (whoever "we" are) can handle any sequence of glyphs and combining characters (correctly-formed or not) in any language at any time, so we're the special case...? Yeah, it sounds funny to me, too. > The 5 billion people outside the US and EU are _not the special case_. Fortunately, it works for them to. > The problem is all the rest, and those just below who cannot afford it at all, in part because the tech is not as efficient as it could be yet. Ditching UTF-8 will be one way to make it more efficient. All right, now you've found the special case; the case where the generic, unambiguous encoding may need to be lowered to something else: people for whom that encoding is suboptimal because of _current_ network constraints. I fully acknowledge it's a couple billion people and that's nothing to sneeze at, but I also see that it's a situation that will become less relevant over time. -Wyatt

On Wednesday, 1 June 2016 at 17:57:15 UTC, Andrei Alexandrescu wrote: > On 06/01/2016 01:35 PM, ZombineDev wrote: >> On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu wrote: >>> On 05/31/2016 02:46 PM, Timon Gehr wrote: >>>> On 31.05.2016 20:30, Andrei Alexandrescu wrote: >>>>> D's >>>> >>>> Phobos' >>> >>> foreach, too. -- Andrei >> >> Incorrect. https://dpaste.dzfl.pl/ba7a65d59534 > > Try typing the iteration variable with "dchar". -- Andrei I think you are not getting my point. This is not autodecoding. There is nothing auto-magic w.r.t. strings in plain foreach. Typing char, wchar or dchar is the same using byChar, byWchar or byDchar - it is opt-in. The only problems are the front, empty and popFront overloads for narrow strings.

Forums