Jump to page: 1 25  
Page
Thread overview
Unicode handling comparison
Nov 27, 2013
bearophile
Nov 27, 2013
Simen Kjærås
Nov 27, 2013
monarch_dodra
Nov 27, 2013
David Nadlinger
Nov 27, 2013
Jacob Carlborg
Nov 27, 2013
Adam D. Ruppe
Nov 27, 2013
Jacob Carlborg
Nov 27, 2013
bearophile
Nov 27, 2013
Wyatt
Nov 27, 2013
Dicebot
Nov 27, 2013
Jacob Carlborg
Nov 27, 2013
Jakob Ovrum
Nov 27, 2013
Jacob Carlborg
Nov 27, 2013
Jakob Ovrum
Nov 27, 2013
Dicebot
Nov 27, 2013
Jacob Carlborg
Nov 27, 2013
Dmitry Olshansky
Nov 27, 2013
H. S. Teoh
Nov 27, 2013
Wyatt
Nov 28, 2013
Walter Bright
Nov 28, 2013
Jakob Ovrum
Nov 28, 2013
bearophile
Nov 28, 2013
monarch_dodra
Nov 28, 2013
Walter Bright
Nov 28, 2013
H. S. Teoh
Nov 28, 2013
Dicebot
Nov 28, 2013
monarch_dodra
Nov 29, 2013
Walter Bright
Nov 28, 2013
Dmitry Olshansky
Nov 29, 2013
Walter Bright
Nov 27, 2013
Charles Hixson
Nov 27, 2013
Dmitry Olshansky
Nov 29, 2013
Walter Bright
Nov 27, 2013
Jakob Ovrum
Nov 27, 2013
Wyatt
Nov 27, 2013
Wyatt
Nov 27, 2013
Jakob Ovrum
Nov 27, 2013
Dmitry Olshansky
Nov 27, 2013
Jakob Ovrum
Nov 27, 2013
Charles Hixson
Nov 27, 2013
Walter Bright
Nov 27, 2013
Dmitry Olshansky
Nov 27, 2013
H. S. Teoh
Nov 27, 2013
Dmitry Olshansky
Nov 29, 2013
Jakob Ovrum
Nov 27, 2013
Simen Kjærås
Nov 27, 2013
Gary Willoughby
November 27, 2013
Through Reddit I have seen this small comparison of Unicode handling between different programming languages:

http://mortoray.com/2013/11/27/the-string-type-is-broken/

D+Phobos seem to fail most things (it produces BAFFLE):
http://dpaste.dzfl.pl/a5268c435

Bye,
bearophile
November 27, 2013
On 2013-11-27 13:46, bearophile wrote:
> Through Reddit I have seen this small comparison of Unicode handling
> between different programming languages:
>
> http://mortoray.com/2013/11/27/the-string-type-is-broken/
>
> D+Phobos seem to fail most things (it produces BAFFLE):
> http://dpaste.dzfl.pl/a5268c435

Indeed it does. Have you tried with std.uni?

--
  Simen
November 27, 2013
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
> D+Phobos seem to fail most things (it produces BAFFLE):

I still think we're doing pretty good.

At least, we *handle* unicode at all (looking at you C++). And we handle *true* unicode, not BMP style UCS (looking at you Java/C#), with the options of storing said strings in any encoding: UTF8 through UTF32, and the possibility to also have ASCII.

We don't yet totally handle things like diacritics or ligatures, but we are getting there.

As a whole, I find that D is incredibly "unicode correct enough" out of the box, and with no extra effort involved.
November 27, 2013
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
> Through Reddit I have seen this small comparison of Unicode handling between different programming languages:
>
> http://mortoray.com/2013/11/27/the-string-type-is-broken/
>
> D+Phobos seem to fail most things (it produces BAFFLE):
> http://dpaste.dzfl.pl/a5268c435

If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to your strings (which defaults to NFC) would make the code produce the "expected" results.

As far as I'm aware, this behavior is the result of a deliberate decision, as normalizing strings on the fly isn't really cheap.

David
November 27, 2013
On 2013-11-27 15:45, David Nadlinger wrote:

> If you need to perform this kind of operations on Unicode strings in D,
> you can call normalize (std.uni) on the string first to make sure it is
> in one of the Normalization Forms. For example, just appending
> .normalize to your strings (which defaults to NFC) would make the code
> produce the "expected" results.

That didn't work out very well:

std/uni.d(6301): Error: undefined identifier tuple

-- 
/Jacob Carlborg
November 27, 2013
On Wednesday, 27 November 2013 at 15:03:37 UTC, Jacob Carlborg wrote:
> std/uni.d(6301): Error: undefined identifier tuple

Yeah, I saw it too. The fix is simple:

https://github.com/D-Programming-Language/phobos/pull/1728

tbh this makes me think version(unittest) might just be considered harmful. I'm sure that code passed the tests, but only because a vital import was in a version(unittest) secion!
November 27, 2013
On 2013-11-27 16:07, Adam D. Ruppe wrote:

> Yeah, I saw it too. The fix is simple:
>
> https://github.com/D-Programming-Language/phobos/pull/1728
>
> tbh this makes me think version(unittest) might just be considered
> harmful. I'm sure that code passed the tests, but only because a vital
> import was in a version(unittest) secion!

You were faster. But I created an issue as well.

-- 
/Jacob Carlborg
November 27, 2013
David Nadlinger:

> If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to your strings (which defaults to NFC) would make the code produce the "expected" results.
>
> As far as I'm aware, this behavior is the result of a deliberate decision, as normalizing strings on the fly isn't really cheap.

Thank you :-)

Bye,
bearophile
November 27, 2013
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
> Through Reddit I have seen this small comparison of Unicode handling between different programming languages:
>
> http://mortoray.com/2013/11/27/the-string-type-is-broken/

Most of the points are good, but the author seems to confuse UCS-2 with UTF-16, so the whole point about UTF-16 is plain wrong.

The author also doesn't seem to understand the Unicode definitions of character and grapheme, which is a shame, because the difference is more or less the whole point of the post.

> D+Phobos seem to fail most things (it produces BAFFLE):
> http://dpaste.dzfl.pl/a5268c435

D strings are arrays of code units and ranges of code points. The failure here is yours; in that you didn't use std.uni to handle graphemes.

On that note, I tried to use std.uni to write a simple example of how to correctly handle this in D, but it became apparent that std.uni should expose something like `byGrapheme` which lazily transforms a range of code points to a range of graphemes (probably needs a `byCodePoint` to do the converse too). The two extant grapheme functions, `decodeGrapheme` and `graphemeStride`, are *awful* for string manipulation (granted, they are probably perfect for text rendering).
November 27, 2013
On Wednesday, 27 November 2013 at 14:45:32 UTC, David Nadlinger wrote:
>
> If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to your strings (which defaults to NFC) would make the code produce the "expected" results.
>
Seems like a pretty big "gotcha" from a usability standpoint; it's not exactly intuitive.  I understand WHY this decision was made, but it feels like a source of code smell and weird string comparison errors.

> As far as I'm aware, this behavior is the result of a deliberate decision, as normalizing strings on the fly isn't really cheap.
>
I don't remember if it was brought up before, but this makes me wonder if something like an i18nString should exist for cases where it IS important.  Making i18n stuff as simple as it looks like it "should" be has merit, IMO.  (Maybe there's even room for a std.string.i18n submodule?)

-Wyatt
« First   ‹ Prev
1 2 3 4 5