Replacing tango.text.Ascii.isearch (page 4)

On 26/10/2022 6:06 PM, Siarhei Siamashka wrote: > Should we ignore the `"D should strive to be correct, rather than fast"` comment from bauss for now? Or some actions can be taken to improve the current situation? Bauss is correct. It should be implemented but it does not need to be fast. But yeah, if you are able to ignore that Unicode is a thing, I'd recommend it. It is complicated, as we humans are very complicated ;)

On 26/10/2022 6:49 PM, Siarhei Siamashka wrote: > On Wednesday, 26 October 2022 at 05:17:06 UTC, rikki cattermole wrote: >> if you are able to ignore that Unicode is a thing, I'd recommend it. It is complicated, as we humans are very complicated ;) > > I can't ignore Unicode, because I frequently have to deal with Cyrillic alphabet ;) Also Unicode is significantly simpler than a set of various incompatible 8-bit encodings (such as [CP1251](https://en.wikipedia.org/wiki/Windows-1251) vs. variants of [KOI-8](https://en.wikipedia.org/wiki/KOI-8) vs. [ISO/IEC 8859-5](https://en.wikipedia.org/wiki/ISO/IEC_8859-5)) that were simultaneously in use earlier and caused a lot of pain. But I'm surely able to ignore the peculiarities of modern Turkish Unicode and wait for the other people to come up with a solution for D language if they really care. Cyrillic isn't an issue. Lithuanian, Turkish and Azeri are the ones with the biggest issues. There is a bunch of non-simple mappings for Latin, Armenian and Greek, but they are not language dependent. There is six conditional ones which are all Greek. So if you are not dealing with these languages (even if you are, a simple replace should be easy to do for most), you should be fine with the simple mappings supported by std.uni.

On 10/25/22 22:49, Siarhei Siamashka wrote: > Unicode is significantly simpler than a set of various > incompatible 8-bit encodings Strongly agreed. > I'm surely > able to ignore the peculiarities of modern Turkish Unicode The problem with Unicode is its main aim of allowing characters of multiple writing systems in the same text. When multiple writing systems are in play, conflicts and ambiguities will appear. > and wait for > the other people to come up with a solution for D language if they > really care. I solved my problem by writing an Alphabet hierarchy in the past. I don't like that code but it still works: https://bitbucket.org/acehreli/ddili/src/4c0552fe8352dfe905c9734a57d84d36ce4ed476/src/alphabet.d#lines-50 It handles capitalization, ordering, etc. I use it when preparing the Index section of the Turkish edition of "Programming in D": http://ddili.org/ders/d/ix.html One of the ambiguities is what came up on this thread: Should a word that starts with I (capital i) be listed under I (because it's Turkish) or under İ (because it's English)? So far, I am lucky because the only word that starts with I happens to be the English "IDE", so it goes under i (which appears as İ in the Turkish edition) and would make sense to a Turkish reader because a Turkish reader might (really?) accept it as the capital of ide. It's confusing but it seems to work. :) It doesn't matter. Life is imperfect and things will somehow work in the end. Ali

October 28, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by Siarhei Siamashka
in reply to Ali Çehreli

Permalink

Siarhei Siamashka

Posted in reply to Ali Çehreli

Permalink

On Wednesday, 26 October 2022 at 06:05:14 UTC, Ali Çehreli wrote:

The problem with Unicode is its main aim of allowing characters of multiple writing systems in the same text. When multiple writing systems are in play, conflicts and ambiguities will appear.

I personally don't think that it's the problem of the Unicode itself. Based on what I can see, it looks like the individuals or the committees responsible for mapping the Turkish alphabet to Unicode just made a blunder.

For example, let's compare the Latin uppercase "B" and the Cyrillic uppercase "В". Looks exactly the same, right? Would it be a smart idea for them to share the same index in the Unicode table? But wait. What happens if we convert these letters to lowercase? The Latin "B" becomes "b" and the Cyrillic "В" becomes "в". Oops! So by having different indexes for the Latin uppercase "B" and the Cyrillic uppercase "В", we dodged a whole bunch of nasty problems.

Another example. Patrick Schluter mentioned the Greek sigma letter and the wikipedia article says: "uppercase Σ, lowercase σ, lowercase in word-final position ς", which makes everything rather problematic. Now let's compare this to the Belarusian language and its letter "у". The Belarusian "у" transforms into "ў" depending on context, however this transformation doesn't happen for the first letter of proper nouns or in acronyms (and this theoretically makes the uppercase "ў" redundant). Just imagine an alternative Greek-inspired reality, where both "у" and "ў" uppercase to "У". And yet the uppercase "Ў" exists in Unicode, so luckily in our reality we don't have to deal with uppercase/lowercase round trip failures. This is computers friendly. And as I already mentioned in an earlier comment, the Germans also got the uppercase "ẞ" in Unicode since 2008 (better late than never).

I solved my problem by writing an Alphabet hierarchy in the past. I don't like that code but it still works:

[...]

It's confusing but it seems to work. :) It doesn't matter. Life is imperfect and things will somehow work in the end.

What's your opinion/conclusion? Is it fine the way it is? Do you think that some unique property of the Turkish language/alphabet made these difficulties unavoidable? Or do you think that it was a mistake, but now it has to live with us forever for compatibility reasons? Anything else?

And as for the D language and Phobos, should "ß" still uppercase to "SS"? Or can we change it to uppercase "ẞ" and remove German from the list of tricky languages at https://dlang.org/library/std/uni/to_upper.html ? Should Turkish be listed there?

On 29/10/2022 11:05 AM, Siarhei Siamashka wrote: > And as for the D language and Phobos, should "ß" still uppercase to "SS"? Or can we change it to uppercase "ẞ" and remove German from the list of tricky languages at https://dlang.org/library/std/uni/to_upper.html ? Should Turkish be listed there? That particular function, is based upon the simple mappings provided by UnicodeData.txt and (should be) in compliance of the Unicode standard. The only thing we need to do is regenerate the tables backing it whenever Unicode updates. Note the behavior you are asking for is defined in the Unicode database file SpecialCasing.txt which have not been implemented. ``` # The German es-zed is special--the normal mapping is to SS. # Note: the titlecase should never occur in practice. It is equal to titlecase(uppercase(<es-zed>)) 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S ``` That file is how you support languages like Turkish. We currently don't have it implemented. It requires operating on a whole string and to pass in what language rules to apply (i.e. Turkish, Azeri).

Forums