Thread overview
D'ish similar_text?
May 10, 2018
Dgame
May 10, 2018
Vladimir Panteleev
May 10, 2018
Dgame
May 10, 2018
Vladimir Panteleev
May 10, 2018
Dgame
May 10, 2018
I'm in need for some sort of string similarity comparision. I've found soundex but that didn't solved my needs. After some search I found a C implementation of similar_text, but that is quite ugly... I was able to let it work in D but it's still somewhat messy. Is there any D implementation of similar_text? Here's my current code which makes heavy use of pointer-arithmetic which makes it hard to understand.

----
void similar_text_similar_str(char* txt1, size_t len1, char* txt2, size_t len2, size_t* pos1, size_t* pos2, size_t* max) {
    char* p, q;
    char* end1 = txt1 + len1;
    char* end2 = txt2 + len2;
    size_t l;

    *max = 0;
    for (p = txt1; p < end1; p++) {
        for (q = txt2; q < end2; q++) {
            for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++) { }
            if (l > *max) {
                *max = l;
                *pos1 = p - txt1;
                *pos2 = q - txt2;
            }
        }
    }
}

size_t similar_text_similar_char(char* txt1, size_t len1, char* txt2, size_t len2) {
    size_t sum;
    size_t pos1, pos2, max;

    similar_text_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);
    if ((sum = max) != 0) {
        if (pos1 && pos2)  {
            sum += similar_text_similar_char(txt1, pos1, txt2, pos2);
        }
        if ((pos1 + max < len1) && (pos2 + max < len2)) {
            sum += similar_text_similar_char(txt1 + pos1 + max, len1 - pos1 - max,
                    txt2 + pos2 + max, len2 - pos2 - max);
        }
    }

    return sum;
}

size_t similar_text(in string s1, in string s2, out double percent) {
    immutable size_t sim = similar_text_similar_char(s1.dup.ptr, s1.length, s2.dup.ptr, s2.length);
    percent = sim * 200.0 / (s1.length + s2.length);

    return sim;
}
----
May 10, 2018
On Thursday, 10 May 2018 at 20:08:04 UTC, Dgame wrote:
> void similar_text_similar_str(char* txt1, size_t len1, char*

That looks like an implementation of Levenshtein distance. We have one in Phobos:

https://dlang.org/library/std/algorithm/comparison/levenshtein_distance.html

May 10, 2018
On Thursday, 10 May 2018 at 20:13:49 UTC, Vladimir Panteleev wrote:
> On Thursday, 10 May 2018 at 20:08:04 UTC, Dgame wrote:
>> void similar_text_similar_str(char* txt1, size_t len1, char*
>
> That looks like an implementation of Levenshtein distance. We have one in Phobos:
>
> https://dlang.org/library/std/algorithm/comparison/levenshtein_distance.html

Oh, that could to work, thank you. I've just tested it quickly, but that code seems to produce the same results:

----
size_t similar_text2(in string s1, in string s2, out double percent) {
    import std.algorithm: levenshteinDistance;

    immutable size_t distance = s1.levenshteinDistance(s2);
    immutable size_t len = s1.length + s2.length;
    percent = (len - distance) * 100.0 / len;

    return distance;
}
----
May 10, 2018
On Thursday, 10 May 2018 at 20:32:11 UTC, Dgame wrote:
>     immutable size_t len = s1.length + s2.length;
>     percent = (len - distance) * 100.0 / len;

Note that this formula will give you only 50% similarity for "abc" and "def", i.e. two completely different strings. I suggest to divide by max(s1.length, s2.length) instead.
May 10, 2018
On Thursday, 10 May 2018 at 20:38:12 UTC, Vladimir Panteleev wrote:
> On Thursday, 10 May 2018 at 20:32:11 UTC, Dgame wrote:
>>     immutable size_t len = s1.length + s2.length;
>>     percent = (len - distance) * 100.0 / len;
>
> Note that this formula will give you only 50% similarity for "abc" and "def", i.e. two completely different strings. I suggest to divide by max(s1.length, s2.length) instead.

Hm, that does not work either. ABC and AZB have a different outcome with both. How can I calculate the percentage with levenshtein? It's rather simple with similar_text since it returns the amount of similar chars.