December 02, 2017
On 12/1/2017 8:08 PM, Jonathan M Davis wrote:
> And personally, I think that their worst decisions tend to be at the code
> point level (e.g. having the same character being representable by different
> combinations of code points).

Yup. I've presented that point of view a couple times on HackerNews, and some Unicode people took umbrage at that. The case they presented fell a little flat.


> Quite possbily the most depressing thing that I've run into with Unicode
> though was finding out that emojis had their own code points. Emojis are
> specifically representable by a sequence of existing characters (usually
> ASCII), because they came from folks trying to represent pictures with text.
> The fact that they're then trying to put those pictures into the Unicode
> standard just blatantly shows that the Unicode folks have lost sight of what
> they're up to. It's like if they started trying to add Unicode characters
> for words. It makes no sense. But unfortunately, we just have to live with
> it... :(

Yah, I've argued against that, too. And those "international" icons are arguably one of the dumber ideas to ever sweep the world, yet they seem to be celebrated without question.

Have you ever tried to look up an icon in a dictionary? It doesn't work. So if you don't know what an icon means, you're hosed. If it is a word you don't understand, you can look it up in a dictionary.

Furthermore, you don't need to know English to know what "ON" means. There is no more cognitive difficulty asking someone what "ON" means than there is asking what "|" means. Is an illiterate person from XxLand really going to understand that "|" means "ON" without help?

My car has a bunch emoticons labeling the controls. I can't figure out what any of them do without reading the manual, or just pushing random buttons until what I want happens. One button has an icon on it that looks like a snowflake. What does that do? Turn on the A/C? Defrost the frosty windows? Set the AWD in slippery mode? Turn on the Christmas lights?

On my pre-madness truck, they're labeled in English. Never had any trouble with that.

Part of the problem I've seen is that people do things like "vote for my emoji/icon and I'll vote for yours!" And then when they get something accepted, they wear it as a badge of status and write articles saying how you, too, can get your whatever accepted as an icon. It's madness, madness I say!
December 02, 2017
On 2017-12-02 11:02, Walter Bright wrote:

> Are you sure about that? I know that Asian languages will be longer in UTF-8. But how much data that programs handle is in those languages? The language of business, science, programming, aviation, and engineering is english.

Not necessarily. I've seen code in non-English languages, i.e. when the identifiers are non-English. But of course, most programming languages will using English for keywords and built-in functions.

-- 
/Jacob Carlborg
December 02, 2017
On Friday, 1 December 2017 at 23:16:45 UTC, H. S. Teoh wrote:
> On Fri, Dec 01, 2017 at 03:04:44PM -0800, Walter Bright via Digitalmars-d wrote:
>> On 11/30/2017 9:23 AM, Kagamin wrote:
>> > On Tuesday, 28 November 2017 at 03:37:26 UTC, rikki cattermole wrote:
>> > > Be aware Microsoft is alone in thinking that UTF-16 was awesome. Everybody else standardized on UTF-8 for Unicode.
>> > 
>> > UCS2 was awesome. UTF-16 is used by Java, JavaScript, Objective-C, Swift, Dart and ms tech, which is 28% of tiobe index.
>> 
>> "was" :-) Those are pretty much pre-surrogate pair designs, or based
>> on them (Dart compiles to JavaScript, for example).
>> 
>> UCS2 has serious problems:
>> 
>> 1. Most strings are in ascii, meaning UCS2 doubles memory consumption. Strings in the executable file are twice the size.
>
> This is not true in Asia, esp. where the CJK block is extensively used. A CJK block character is 3 bytes in UTF-8, meaning that string sizes are 150% of the UCS2 encoding.  If your code contains a lot of CJK text, that's a lot of bloat.

That's true in theory, in practice it's not that severe as the CJK languages are never isolated and appear embedded in a lot of ASCII. You can read here a case study [1] which shows 106% for Simplified Chinese, 76% for Traditional Chinese, 129% for Japanese and 94% for Korean. These numbers for pure text. Publish it on the web embedded in bloated html and there goes the size advantage of UTF-16



>
> But then again, in non-Latin locales you'd generally store your strings separately of the executable (usually in l10n files), so this may not be that big an issue. But the blanket statement "Most strings are in ASCII" is not correct.
>
False, in the sense that isolated pure text is rare and is generally delivered inside some file format, most times ASCII based like docx, odf, tmx, xliff, akoma ntoso etc...

[1]: https://stackoverflow.com/questions/6883434/at-all-times-text-encoded-in-utf-8-will-never-give-us-more-than-a-50-file-size

December 02, 2017
On Saturday, 2 December 2017 at 10:35:50 UTC, Patrick Schluter wrote:
> On Friday, 1 December 2017 at 23:16:45 UTC, H. S. Teoh wrote:
>> [...]
>
> That's true in theory, in practice it's not that severe as the CJK languages are never isolated and appear embedded in a lot of ASCII. You can read here a case study [1] which shows 106% for Simplified Chinese, 76% for Traditional Chinese, 129% for Japanese and 94% for Korean. These numbers for pure text.

106% for Korean, copied the wrong column. Traditiojal Chinese was smaller, probably because of whitespaces.

> Publish it on the web embedded in bloated html and there goes the size advantage of UTF-16
>
> [...]

December 02, 2017
On Saturday, 2 December 2017 at 04:08:54 UTC, Jonathan M Davis wrote:
>
> The fact that they're then trying to put those pictures into the Unicode standard just blatantly shows that the Unicode folks have lost sight of what they're up to. It's like if they started trying to add Unicode characters for words. It makes no sense. But unfortunately, we just have to live with it... :(
>
> - Jonathan M Davis

The real problem, is that sometimes people don't feel like a little cat with a smiling face. Sometimes, people actually get pissed off at something, and would like to express it.

Do the people on the unicode consortium consider such communication to be invalid?

Where are the emoji's for saying.. I'm pissed off at this..or that..

(unicode consortium == emoji censorship)

https://www.google.com.au/search?q=fuck+you+emoticon&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiWkMzMpOvXAhWIj5QKHVnGC5YQ_AUICigB&biw=1536&bih=736

December 02, 2017
On Saturday, 2 December 2017 at 04:08:54 UTC, Jonathan M Davis wrote:
> code points. Emojis are specifically representable by a sequence of existing characters (usually ASCII), because they came from folks trying to represent pictures with text.

They are used as symbols culturally, which is how written language happen, so I think the real question is if they have just implemented the ones that have become widespread over a long period of time or if they have deliberately created completely new ones... It makes sense for the most used ones.

E.g. I don't want "8-(3+4)" to render as "😳3+4" ;-)

There is also a difference between Ø and ∅, because the meaning is different. Too bad the same does not apply to arrows (math vs non math usage).

So yeah, they could do better, but not too bad. If something is widely used in a way that gives signs a different meaning then it makes sense to introduce a new symbol for it so that one both can render them slightly differently and so that the programs can interpret them correctly.



December 02, 2017
On Saturday, 2 December 2017 at 10:20:10 UTC, Walter Bright wrote:
> On 12/1/2017 8:08 PM, Jonathan M Davis wrote:
>> [...]
>
> Yup. I've presented that point of view a couple times on HackerNews, and some Unicode people took umbrage at that. The case they presented fell a little flat.
>
> [...]

Where it gets really fun is the when there is color composition for emoticons
U+1F466 = 👦
U+1F466 U+1F3FF = 👦🏿
December 02, 2017
On Saturday, 2 December 2017 at 12:25:22 UTC, codephantom wrote:
> Do the people on the unicode consortium consider such communication to be invalid?

https://splinternews.com/violent-emoji-are-starting-to-get-people-in-trouble-wit-1793845130

On the other hand try to google "emoji sexual"…

December 02, 2017
On Friday, 1 December 2017 at 23:16:45 UTC, H. S. Teoh wrote:
> On Fri, Dec 01, 2017 at 03:04:44PM -0800, Walter Bright via Digitalmars-d wrote:
>> On 11/30/2017 9:23 AM, Kagamin wrote:
>> > On Tuesday, 28 November 2017 at 03:37:26 UTC, rikki cattermole wrote:
>> > > Be aware Microsoft is alone in thinking that UTF-16 was awesome. Everybody else standardized on UTF-8 for Unicode.
>> > 
>> > UCS2 was awesome. UTF-16 is used by Java, JavaScript, Objective-C, Swift, Dart and ms tech, which is 28% of tiobe index.
>> 
>> "was" :-) Those are pretty much pre-surrogate pair designs, or based
>> on them (Dart compiles to JavaScript, for example).
>> 
>> UCS2 has serious problems:
>> 
>> 1. Most strings are in ascii, meaning UCS2 doubles memory consumption. Strings in the executable file are twice the size.
>
> This is not true in Asia, esp. where the CJK block is extensively used. A CJK block character is 3 bytes in UTF-8, meaning that string sizes are 150% of the UCS2 encoding.  If your code contains a lot of CJK text, that's a lot of bloat.

Yep, that's why five years back many of the major Chinese sites were still not using UTF-8:

http://xahlee.info/w/what_encoding_do_chinese_websites_use.html

That led that Chinese guy to also rant against UTF-8 a couple years ago:

http://xahlee.info/comp/unicode_utf8_encoding_propaganda.html

Considering China buys more smartphones than the US and Europe combined, it's time people started recognizing their importance when it comes to issues like this:

https://www.statista.com/statistics/412108/global-smartphone-shipments-global-region/

Regarding the unique representation issue Jonathan brings up, I've heard people say that was to provide an easier path for legacy encodings, ie some used combining characters and others didn't, so Unicode chose to accommodate both so both groups would move to Unicode.  It would be nice if the Unicode people spent their time pruning and regularizing what they have, rather than adding more useless stuff.

Speaking of which, completely agree with Walter and Jonathan that there's no need to add emoji and other such symbols to Unicode, should have never been added.  Unicode is supposed to standardize long-existing characters, not promote marginal new symbols to characters.  If there's a real need for it, chat software will figure out a way to do it, no need to add such symbols to the Unicode character set.
December 02, 2017
On 11/30/2017 10:07 PM, Patrick Schluter wrote:
> endianness

Yeah, I forgot to mention that one. As if anyone remembers to put in the Byte Order Mark :-(
1 2 3 4 5 6 7 8