Unicode handling comparison (page 2)

On Wednesday, 27 November 2013 at 15:43:11 UTC, Jakob Ovrum wrote: > > The author also doesn't seem to understand the Unicode definitions of character and grapheme, which is a shame, because the difference is more or less the whole point of the post. > I agree with the assertion that people SHOULD know how unicode works if they want to work with it, but the way our docs are now is off-putting enough that most probably won't learn anything. If they know, they know; if they don't, the wall of jargon is intimidating and hard to grasp (more examples up front of more things that you'd actually use std.uni for). Even though I'm decently familiar with Unicode, I was having trouble following all that (e.g. Isn't "noe\u0308l" a grapheme cluster according to std.uni?). On the flip side, std.utf has a serious dearth of examples and the relationship between the two isn't clear. > On that note, I tried to use std.uni to write a simple example of how to correctly handle this in D, but it became apparent that std.uni should expose something like `byGrapheme` which lazily transforms a range of code points to a range of graphemes (probably needs a `byCodePoint` to do the converse too). The two extant grapheme functions, `decodeGrapheme` and `graphemeStride`, are *awful* for string manipulation (granted, they are probably perfect for text rendering). Yes, please. While operations on single codepoints and characters seem pretty robust (i.e. you can do lots of things with and to them), it feels like it just falls apart when you try to work with strings. It honestly surprised me how many things in std.uni don't seem to work on ranges. -Wyatt

On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote: > > trouble following all that (e.g. Isn't "noe\u0308l" a grapheme > Whoops, overzealous pasting. That is, "e\u0308", which composes to "ë". A grapheme cluster seems to represent one printed character: "...a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it." Is that about right? -Wyatt

On Wednesday, 27 November 2013 at 16:15:53 UTC, Wyatt wrote: > Seems like a pretty big "gotcha" from a usability standpoint; it's not exactly intuitive. I understand WHY this decision was made, but it feels like a source of code smell and weird string comparison errors. It probably is, but is Unicode gotcha, not D one.

On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote: > Through Reddit I have seen this small comparison of Unicode handling between different programming languages: > > http://mortoray.com/2013/11/27/the-string-type-is-broken/ > > D+Phobos seem to fail most things (it produces BAFFLE): > http://dpaste.dzfl.pl/a5268c435 > > Bye, > bearophile Ha, i was just discussing that here: http://forum.dlang.org/thread/xmusisihhbmefeigvxvd@forum.dlang.org

On Wednesday, 27 November 2013 at 16:22:58 UTC, Wyatt wrote: > Whoops, overzealous pasting. That is, "e\u0308", which composes to "ë". A grapheme cluster seems to represent one printed character: "...a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it." > > Is that about right? > > -Wyatt Yes. A grapheme is also sometimes explained as being the unit that lay people intuitively think of as being a "character". The difference between a grapheme and a grapheme cluster is just a matter of perspective, like the difference between a character and a code point; the former simply refers to the decoded result, while the latter refers to the sum of encoding parts (where the parts are code points for grapheme cluster, and code units for a code point). Yet another example is that of the UTF-32 code unit: one UTF-32 code unit is (currently) equal to one Unicode code point, but both terms are meaningful in the right context.

November 27, 2013

Re: Unicode handling comparison

Posted by Jakob Ovrum
in reply to Wyatt

Permalink

Jakob Ovrum

Posted in reply to Wyatt

Permalink

On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:
> I agree with the assertion that people SHOULD know how unicode works if they want to work with it, but the way our docs are now is off-putting enough that most probably won't learn anything.  If they know, they know; if they don't, the wall of jargon is intimidating and hard to grasp (more examples up front of more things that you'd actually use std.uni for).  Even though I'm decently familiar with Unicode, I was having trouble following all that (e.g. Isn't "noe\u0308l" a grapheme cluster according to std.uni?).  On the flip side, std.utf has a serious dearth of examples and the relationship between the two isn't clear.

I thought it was nice that std.uni had a proper terminology section, complete with links to Unicode documents to kick-start beginners to Unicode. It mentions its relationship with std.utf right at the top.

Maybe the first paragraph is just too thin, and it's hard to see the big picture. Maybe it should include a small leading paragraph detailing the three levels of Unicode granularity that D/Phobos chooses; arrays of code units -> ranges of code points -> std.uni for graphemes and algorithms.

> Yes, please.  While operations on single codepoints and characters seem pretty robust (i.e. you can do lots of things with and to them), it feels like it just falls apart when you try to work with strings.  It honestly surprised me how many things in std.uni don't seem to work on ranges.
>
> -Wyatt

Most string code is Unicode-correct as long as it works on code points and all inputs are of the same normalization format; explicit grapheme-awareness is rarely a necessity. By that I mean the most common string operations, such as searching, getting a substring etc. will work without any special grapheme decoding (beyond normalization).

The hiccups appear when code points are shuffled around, or the order is changed. Apart from these rare string manipulation cases, grapheme awareness is necessary for rendering code.

On 2013-11-27 17:15, Wyatt wrote: > I don't remember if it was brought up before, but this makes me wonder > if something like an i18nString should exist for cases where it IS > important. Making i18n stuff as simple as it looks like it "should" be > has merit, IMO. (Maybe there's even room for a std.string.i18n submodule?) I think we should have that. -- /Jacob Carlborg

On Wednesday, 27 November 2013 at 16:15:53 UTC, Wyatt wrote: > I don't remember if it was brought up before, but this makes me wonder if something like an i18nString should exist for cases where it IS important. Making i18n stuff as simple as it looks like it "should" be has merit, IMO. (Maybe there's even room for a std.string.i18n submodule?) > > -Wyatt What would it do that std.uni doesn't already? i18nString sounds like a range of graphemes to me. I would like a convenient function in std.uni to get such a range of graphemes from a range of points, but I wouldn't want to elevate it to any particular status; that would be a knee-jerk reaction. D's granularity when it comes to Unicode is because there is an appropriate level of representation for each domain. Shoe-horning everything into a range of graphemes is something we should avoid. In D, we can write code that is both Unicode-correct and highly performant, while still being simple and pleasant to read. To write such code, one must have a modicum of understanding of how Unicode works (in order to choose the right tools from the toolbox), but I think it's a novel compromise.

On Wednesday, 27 November 2013 at 17:30:22 UTC, Jacob Carlborg wrote: > On 2013-11-27 18:22, Jakob Ovrum wrote: > >> What would it do that std.uni doesn't already? > > A class/struct that handles all these normalizations and other stuff automatically. Sounds terrible :)

Forums