October 16, 2013 Re: Inconsitency | ||||
---|---|---|---|---|
| ||||
Posted in reply to qznc | 16-Oct-2013 23:42, qznc пишет: > On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra wrote: >> On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg wrote: >>> On 2013-10-16 14:33, qznc wrote: >>> >>>> It is either [U+00E4] as one code point or [a,U+0308] for two code >>>> points. The second is "combining diaeresis" [0]. Not required, but >>>> possible. Those combining characters [1] provide a nearly infinite >>>> number of combinations. You can go crazy with it: >>>> http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work >>>> >>>> [0] http://www.fileformat.info/info/unicode/char/0308/index.htm >>>> [1] http://en.wikipedia.org/wiki/Combining_character >>> >>> Aha, now I see. >> >> One of the interesting points, is with "ba\u00E4r" vs "baa\u0308r", >> you can run a replace to replace 'a' with 'o'. Then, you'll get: >> "boär" vs "boör" >> >> Which is the correct behavior? There is no correct answer. >> >> So while a grapheme should never be separated from it's "letter" (eg, >> sorting "oäa" should *not* generate "aaö". What it *should* generate >> is up to debate), you can't entirely consider that a letter+grapheme >> is a single entity. >> >> Long story short: unicode is f***ing complicated. >> >> And I think D does a *damn* fine job of supporting it. In particular, >> it does an awesome job of *teaching* the coder *what* unicode is. >> Virtually everyone here has solid knowledge of unicode (I feel). They >> understand, and can explain it, and can work with. >> >> On the other hand, I don't know many C++ coders that understand unicode. > > I agree with your point. Nevertheless you understanding of grapheme is > off. U+0308 is not a grapheme. "a\u0308" is one grapheme. U+00e4 is the > same grapheme as "a\u0308". s/the same/canonically equivalent/ :) > > http://en.wikipedia.org/wiki/Grapheme -- Dmitry Olshansky |
October 16, 2013 Re: Inconsitency | ||||
---|---|---|---|---|
| ||||
Posted in reply to qznc | On Wednesday, 16 October 2013 at 19:42:59 UTC, qznc wrote:
> I agree with your point. Nevertheless you understanding of grapheme is off. U+0308 is not a grapheme. "a\u0308" is one grapheme. U+00e4 is the same grapheme as "a\u0308".
>
> http://en.wikipedia.org/wiki/Grapheme
Ah. Learn something new every day. :)
|
October 20, 2013 Re: Inconsitency | ||||
---|---|---|---|---|
| ||||
Posted in reply to qznc | On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote: > Most code might be buggy then. All code is buggy. > An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points. And on Windows it's case-insensitive - 2^^N variants of each string. So what? |
Copyright © 1999-2021 by the D Language Foundation