Replacing tango.text.Ascii.isearch (page 3)

Settings

Help

Index » Learn » Replacing tango.text.Ascii.isearch (page 3)

October 13, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by rikki cattermole
in reply to bauss

Permalink

rikki cattermole

Posted in reply to bauss

Permalink

On 13/10/2022 9:27 PM, bauss wrote:
> This doesn't actually work properly in all languages. It will probably work in most, but it's not entirely correct.
> 
> Ex. Turkish will not work with it properly.
> 
> Very interesting article: http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html

Yes turkic languages, they require a state machine and quite a bit of LUTs to work correctly.

You also need to provide a language and it has to operate on the whole string, not individual characters.

I didn't think it was relevant since Ascii was in the original post ;)

October 13, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by bauss
in reply to rikki cattermole

Permalink

bauss

Posted in reply to rikki cattermole

Permalink

On Thursday, 13 October 2022 at 08:30:04 UTC, rikki cattermole wrote:
> On 13/10/2022 9:27 PM, bauss wrote:
>> This doesn't actually work properly in all languages. It will probably work in most, but it's not entirely correct.
>> 
>> Ex. Turkish will not work with it properly.
>> 
>> Very interesting article: http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html
>
> Yes turkic languages, they require a state machine and quite a bit of LUTs to work correctly.
>
> You also need to provide a language and it has to operate on the whole string, not individual characters.
>
> I didn't think it was relevant since Ascii was in the original post ;)

I think it's relevant when it comes to D since D is arguably a unicode language, not ascii.

D should strive to be correct, rather than fast.

October 13, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by bauss
in reply to bauss

Permalink

bauss

Posted in reply to bauss

Permalink

On Thursday, 13 October 2022 at 08:35:50 UTC, bauss wrote:
> On Thursday, 13 October 2022 at 08:30:04 UTC, rikki cattermole wrote:
>> On 13/10/2022 9:27 PM, bauss wrote:
>>> This doesn't actually work properly in all languages. It will probably work in most, but it's not entirely correct.
>>> 
>>> Ex. Turkish will not work with it properly.
>>> 
>>> Very interesting article: http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html
>>
>> Yes turkic languages, they require a state machine and quite a bit of LUTs to work correctly.
>>
>> You also need to provide a language and it has to operate on the whole string, not individual characters.
>>
>> I didn't think it was relevant since Ascii was in the original post ;)
>
> I think it's relevant when it comes to D since D is arguably a unicode language, not ascii.
>
> D should strive to be correct, rather than fast.

Oh and to add onto this, IFF you have to do it the hacky way, then converting to uppercase instead of lowercase should be preferred, because not all lowercase characters can perform round trip, although a small group of characters, then using uppercase fixes it, so that's a relatively easy fix. A round trip is basically converting characters from one culture to another and then back. It's impossible with some characters when converting to lowercase, but should always be possible when converting to uppercase.

October 13, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by rikki cattermole
in reply to bauss

Permalink

rikki cattermole

Posted in reply to bauss

Permalink

On 13/10/2022 9:42 PM, bauss wrote:
> Oh and to add onto this, IFF you have to do it the hacky way, then converting to uppercase instead of lowercase should be preferred, because not all lowercase characters can perform round trip, although a small group of characters, then using uppercase fixes it, so that's a relatively easy fix. A round trip is basically converting characters from one culture to another and then back. It's impossible with some characters when converting to lowercase, but should always be possible when converting to uppercase.

You will want to repeat this process with normalize to NFKC and normalize to NFD before transforming. Otherwise there is a possibility that you will miss some transformations as the simplified mappings are 1:1 for characters and not everything is representable as a single character.

October 13, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by bauss
in reply to rikki cattermole

Permalink

bauss

Posted in reply to rikki cattermole

Permalink

On Thursday, 13 October 2022 at 08:48:49 UTC, rikki cattermole wrote:
> On 13/10/2022 9:42 PM, bauss wrote:
>> Oh and to add onto this, IFF you have to do it the hacky way, then converting to uppercase instead of lowercase should be preferred, because not all lowercase characters can perform round trip, although a small group of characters, then using uppercase fixes it, so that's a relatively easy fix. A round trip is basically converting characters from one culture to another and then back. It's impossible with some characters when converting to lowercase, but should always be possible when converting to uppercase.
>
> You will want to repeat this process with normalize to NFKC and normalize to NFD before transforming. Otherwise there is a possibility that you will miss some transformations as the simplified mappings are 1:1 for characters and not everything is representable as a single character.

Yeah, text isn't easy :D

October 13, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by rikki cattermole
in reply to bauss

Permalink

rikki cattermole

Posted in reply to bauss

Permalink

On 13/10/2022 9:55 PM, bauss wrote:
> Yeah, text isn't easy :D

Indeed!

It has me a bit concerned actually, I'm wondering if my string stuff will even work correctly for UI's due to performance issues.

My string builder for instance allocates like crazy just to do slicing. But hey, at least I can feel confident that my general purpose allocator & infrastructure is working correctly!

October 13, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by Patrick Schluter
in reply to bauss

Permalink

Patrick Schluter

Posted in reply to bauss

Permalink

On Thursday, 13 October 2022 at 08:27:17 UTC, bauss wrote:

On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven Schveighoffer wrote:

On 10/5/22 12:59 PM, torhu wrote:

I need a case-insensitive check to see if a string contains another string for a "quick filter" feature. It should preferrably be perceived as instant by the user, and needs to check a few thousand strings in typical cases. Is a regex the best option, or what would you suggest?

https://dlang.org/phobos/std_uni.html#asLowerCase

bool isearch(S1, S2)(S1 haystack, S2 needle)
{
    import std.uni;
    import std.algorithm;
    return haystack.asLowerCase.canFind(needle.asLowerCase);
}

untested.

-Steve

This doesn't actually work properly in all languages. It will probably work in most, but it's not entirely correct.

Ex. Turkish will not work with it properly.

Greek will also be problematic. 2 different lowercase sigmas but only 1 uppercase. Other languages that may make issues, German where normally ß uppercases as SS (or not) but not the other way round, but here we already arrived to Unicode land and the normalization conundrum.

October 25, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by Siarhei Siamashka
in reply to bauss

Permalink

Siarhei Siamashka

Posted in reply to bauss

Permalink

On Thursday, 13 October 2022 at 08:27:17 UTC, bauss wrote:

> >

bool isearch(S1, S2)(S1 haystack, S2 needle)
{
    import std.uni;
    import std.algorithm;
    return haystack.asLowerCase.canFind(needle.asLowerCase);
}

untested.

-Steve

This doesn't actually work properly in all languages. It will probably work in most, but it's not entirely correct.

Ex. Turkish will not work with it properly.

Very interesting article: http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html

Wow, I didn't expect anything like this and just thought that the nightmares of handling 8-bit codepages for non-English languages ceased to exist nowadays. Too bad. What are the best practices to deal with Turkish text in D language?

For example, Ukrainian letters 'і' and 'І' don't share the same codes with Latin 'i' and 'I' and this is working fine. Except for a possible phishing opportunity. Why haven't the standard committees done the same for Turkish 'I' yet?

As for the German letter 'ß', wikipedia says that the uppercase variant 'ẞ' exists since 2008 (ISO 10646). Do German people use it now?

import std;
void main() {
  "ß".asUpperCase.writeln;             // prints "SS"
  "ẞ".asLowerCase.writeln;             // prints "ß"
  "ẞ".asLowerCase.asUpperCase.writeln; // prints "SS"
}

October 25, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by rikki cattermole
in reply to Siarhei Siamashka

Permalink

rikki cattermole

Posted in reply to Siarhei Siamashka

Permalink

On 25/10/2022 5:17 PM, Siarhei Siamashka wrote:
> Wow, I didn't expect anything like this and just thought that the nightmares of handling 8-bit codepages for non-English languages ceased to exist nowadays. Too bad. What are the best practices to deal with Turkish text in D language?

std.uni doesn't support it.

For casing it only supports the simple mappings which are 1:1 and not language dependent.

I haven't got to it yet for my own string handling library, so I can't point you to that (even if it was not ready).

I'm sure somebody has got it but you may end up wanting to use ICU unfortunately.

October 26, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by Siarhei Siamashka
in reply to rikki cattermole

Permalink

Siarhei Siamashka

Posted in reply to rikki cattermole

Permalink

On Tuesday, 25 October 2022 at 06:32:00 UTC, rikki cattermole wrote:

On 25/10/2022 5:17 PM, Siarhei Siamashka wrote:

What are the best practices to deal with Turkish text in D language?

std.uni doesn't support it.

OK, I'm not specifically interested in this personally and I even would be happy to remain blissfully ignorant. Just wondered whether a preferred solution already exists, considering that this forum has a Turkish section.

Should we ignore the "D should strive to be correct, rather than fast" comment from bauss for now? Or some actions can be taken to improve the current situation?

Top | Forum index | About this forum

Forums