November 30, 2017
On Thursday, 30 November 2017 at 10:19:18 UTC, Walter Bright wrote:
> On 11/27/2017 7:01 PM, A Guy With an Opinion wrote:
>> +- Unicode support is good. Although I think D's string type should have probably been utf16 by default. Especially considering the utf module states:
>> 
>> "UTF character support is restricted to '\u0000' <= character <= '\U0010FFFF'."
>> 
>> Seems like the natural fit for me. Plus for the vast majority of use cases I am pretty guaranteed a char = codepoint. Not the biggest issue in the world and maybe I'm just being overly critical here.
>
> Sooner or later your code will exhibit bugs if it assumes that char==codepoint with UTF16, because of surrogate pairs.
>
> https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java
>
> As far as I can tell, pretty much the only users of UTF16 are Windows programs. Everyone else uses UTF8 or UCS32.
>
> I recommend using UTF8.

Java, .NET, Qt, Javascript, and a handful of others use UTF-16 too, some starting off with the earlier UCS-2:

https://en.m.wikipedia.org/wiki/UTF-16#Usage

Not saying either is better, each has their flaws, just pointing out it's more than just Windows.
November 30, 2017
On Thursday, 30 November 2017 at 10:19:18 UTC, Walter Bright wrote:
> On 11/27/2017 7:01 PM, A Guy With an Opinion wrote:
>> [...]
>
> Sooner or later your code will exhibit bugs if it assumes that char==codepoint with UTF16, because of surrogate pairs.
>
> https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java
>
> As far as I can tell, pretty much the only users of UTF16 are Windows programs. Everyone else uses UTF8 or UCS32.
>
> I recommend using UTF8.

I assume you meant UTF32 not UCS32, given UCS2 is Microsoft's half-assed UTF16.
November 30, 2017
On 11/30/2017 2:39 AM, Joakim wrote:
> Java, .NET, Qt, Javascript, and a handful of others use UTF-16 too, some starting off with the earlier UCS-2:
> 
> https://en.m.wikipedia.org/wiki/UTF-16#Usage
> 
> Not saying either is better, each has their flaws, just pointing out it's more than just Windows.

I stand corrected.
November 30, 2017
On 11/30/2017 2:47 AM, Nicholas Wilson wrote:
>> As far as I can tell, pretty much the only users of UTF16 are Windows programs. Everyone else uses UTF8 or UCS32.
> I assume you meant UTF32 not UCS32, given UCS2 is Microsoft's half-assed UTF16.

I meant UCS-4, which is identical to UTF-32. It's hard keeping all that stuff straight. Sigh.

https://en.wikipedia.org/wiki/UTF-32
November 30, 2017
On Thursday, 30 November 2017 at 10:19:18 UTC, Walter Bright wrote:
> On 11/27/2017 7:01 PM, A Guy With an Opinion wrote:
>> +- Unicode support is good. Although I think D's string type should have probably been utf16 by default. Especially considering the utf module states:
>> 
>> "UTF character support is restricted to '\u0000' <= character <= '\U0010FFFF'."
>> 
>> Seems like the natural fit for me. Plus for the vast majority of use cases I am pretty guaranteed a char = codepoint. Not the biggest issue in the world and maybe I'm just being overly critical here.
>
> Sooner or later your code will exhibit bugs if it assumes that char==codepoint with UTF16, because of surrogate pairs.
>
> https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java
>
> As far as I can tell, pretty much the only users of UTF16 are Windows programs. Everyone else uses UTF8 or UCS32.
>
> I recommend using UTF8.

As long as you understand it's limitations I think most bugs can be avoided. Where UTF16 breaks down, is pretty well defined. Also, super rare. I think UTF32 would be great to, but it seems like just a waste of space 99% of the time. UTF8 isn't horrible, I am not going to never use D because it uses UTF8 (that would be silly). Especially when wstring also seems baked into the language. However, it can complicate code because you pretty much always have to assume character != codepoint outside of ASCII. I can see a reasonable person arguing that it forcing you assume character != code point is actually a good thing. And that is a valid opinion.
November 30, 2017
On Thursday, 30 November 2017 at 11:41:09 UTC, Walter Bright wrote:
> On 11/30/2017 2:47 AM, Nicholas Wilson wrote:
>>> As far as I can tell, pretty much the only users of UTF16 are Windows programs. Everyone else uses UTF8 or UCS32.
>> I assume you meant UTF32 not UCS32, given UCS2 is Microsoft's half-assed UTF16.
>
> I meant UCS-4, which is identical to UTF-32. It's hard keeping all that stuff straight. Sigh.
>
> https://en.wikipedia.org/wiki/UTF-32

It's also worth mentioning that the more I think about it, the UTF8 vs. UTF16 thing was probably not worth mentioning with the rest of the things I listed out. It's pretty minor and more of a preference.
November 30, 2017
On Tuesday, 28 November 2017 at 16:14:52 UTC, Jack Stouffer wrote:

> you can apply attributes to your whole project by adding them to main
>
> void main(string[] args) @safe {}
>
> Although this isn't recommended, as almost no program can be completely safe.

In fact I believe it is. When you have something unsafe you can manually wrap it with @trusted. Same goes with nothrow, since you can catch everything thrown.

But putting @nogc to main is of course not recommended except in special cases, and pure is competely out of question.
November 30, 2017
On Tuesday, 28 November 2017 at 03:37:26 UTC, rikki cattermole wrote:
> Be aware Microsoft is alone in thinking that UTF-16 was awesome. Everybody else standardized on UTF-8 for Unicode.

UCS2 was awesome. UTF-16 is used by Java, JavaScript, Objective-C, Swift, Dart and ms tech, which is 28% of tiobe index.
November 30, 2017
On Tuesday, 28 November 2017 at 03:01:33 UTC, A Guy With an Opinion wrote:
> - Attributes. I had another post in the Learn forum about attributes which was unfortunate. At first I was excited because it seems like on the surface it would help me write better code, but it gets a little tedious and tiresome to have to remember to decorate code with them.

Then do it the C# way. There's choice.

> I think the better decision would be to not have the errors occur.

Hehe, I'm not against living in an idea world either.

> - Immutable. I'm not sure I fully understand it. On the surface it seemed like const but transitive. I tried having a method return an immutable value, but when I used it in my unit test I got some weird errors about objects not being able to return immutable (I forget the exact error...apologies).

That's the point of static type system: if you make a mistake, the code doesn't compile.

> +- Unicode support is good. Although I think D's string type should have probably been utf16 by default. Especially considering the utf module states:
>
> "UTF character support is restricted to '\u0000' <= character <= '\U0010FFFF'."
>
> Seems like the natural fit for me.

UTF-16 in inadequate for range '\u0000' <= character <= '\U0010FFFF', though. UCS2 was adequate (for '\u0000' <= character <= '\uFFFF'), but lost relevance. UTF-16 is only backward compatibility for early adopters of unicode based on UCS2.

> Plus for the vast majority of use cases I am pretty guaranteed a char = codepoint.

That way only end users will be able to catch bugs in production system. It's not the best strategy, is it? Text is often persistent data, how do you plan to fix a text handling bug when corruption accumulated for years and spilled all over the place?
November 30, 2017
On Thursday, November 30, 2017 13:18:37 A Guy With a Question via Digitalmars-d wrote:
> As long as you understand it's limitations I think most bugs can be avoided. Where UTF16 breaks down, is pretty well defined. Also, super rare. I think UTF32 would be great to, but it seems like just a waste of space 99% of the time. UTF8 isn't horrible, I am not going to never use D because it uses UTF8 (that would be silly). Especially when wstring also seems baked into the language. However, it can complicate code because you pretty much always have to assume character != codepoint outside of ASCII. I can see a reasonable person arguing that it forcing you assume character != code point is actually a good thing. And that is a valid opinion.

The reality of the matter is that if you want to write fully valid Unicode, then you have to understand the differences between code units, code points, and graphemes, and since it really doesn't make sense to operate at the grapheme level for everything (it would be terribly slow and is completely unnecessary for many algorithms), you pretty much have to come to accept that in the general case, you can't assume that something like a char represents an actual character, regardless of its encoding. UTF-8 vs UTF-16 doesn't change anything in that respect except for the fact that there are more characters which fit fully in a UTF-16 code unit than a UTF-8 code unit, so it's easier to think that you're correctly handling Unicode when you actually aren't. And if you're not dealing with Asian languages, UTF-16 uses up more space than UTF-8. But either way, they're both wrong if you're trying to treat a code unit as a code point, let alone a grapheme. It's just that we have a lot of programmers who only deal with English and thus don't as easily hit the cases where their code is wrong. For better or worse, UTF-16 hides it better than UTF-8, but the problem exists in both.

- Jonathan M Davis

1 2 3 4 5 6 7 8