D'ish similar_text? - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » D'ish similar_text?

Thread overview

D'ish similar_text?
May 10, 2018 Dgame
May 10, 2018 Vladimir Panteleev
May 10, 2018 Dgame
May 10, 2018 Vladimir Panteleev
May 10, 2018 Dgame

May 10, 2018

D'ish similar_text?

Posted by Dgame

Dgame

I'm in need for some sort of string similarity comparision. I've found soundex but that didn't solved my needs. After some search I found a C implementation of similar_text, but that is quite ugly... I was able to let it work in D but it's still somewhat messy. Is there any D implementation of similar_text? Here's my current code which makes heavy use of pointer-arithmetic which makes it hard to understand.

----
void similar_text_similar_str(char* txt1, size_t len1, char* txt2, size_t len2, size_t* pos1, size_t* pos2, size_t* max) {
    char* p, q;
    char* end1 = txt1 + len1;
    char* end2 = txt2 + len2;
    size_t l;

    *max = 0;
    for (p = txt1; p < end1; p++) {
        for (q = txt2; q < end2; q++) {
            for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++) { }
            if (l > *max) {
                *max = l;
                *pos1 = p - txt1;
                *pos2 = q - txt2;
            }
        }
    }
}

size_t similar_text_similar_char(char* txt1, size_t len1, char* txt2, size_t len2) {
    size_t sum;
    size_t pos1, pos2, max;

    similar_text_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);
    if ((sum = max) != 0) {
        if (pos1 && pos2)  {
            sum += similar_text_similar_char(txt1, pos1, txt2, pos2);
        }
        if ((pos1 + max < len1) && (pos2 + max < len2)) {
            sum += similar_text_similar_char(txt1 + pos1 + max, len1 - pos1 - max,
                    txt2 + pos2 + max, len2 - pos2 - max);
        }
    }

    return sum;
}

size_t similar_text(in string s1, in string s2, out double percent) {
    immutable size_t sim = similar_text_similar_char(s1.dup.ptr, s1.length, s2.dup.ptr, s2.length);
    percent = sim * 200.0 / (s1.length + s2.length);

    return sim;
}
----

May 10, 2018

Re: D'ish similar_text?

Posted by Vladimir Panteleev
in reply to Dgame

Vladimir Panteleev

Posted in reply to Dgame

On Thursday, 10 May 2018 at 20:08:04 UTC, Dgame wrote:
> void similar_text_similar_str(char* txt1, size_t len1, char*

That looks like an implementation of Levenshtein distance. We have one in Phobos:

https://dlang.org/library/std/algorithm/comparison/levenshtein_distance.html

May 10, 2018

Re: D'ish similar_text?

Posted by Dgame
in reply to Vladimir Panteleev

Dgame

Posted in reply to Vladimir Panteleev

On Thursday, 10 May 2018 at 20:13:49 UTC, Vladimir Panteleev wrote:
> On Thursday, 10 May 2018 at 20:08:04 UTC, Dgame wrote:
>> void similar_text_similar_str(char* txt1, size_t len1, char*
>
> That looks like an implementation of Levenshtein distance. We have one in Phobos:
>
> https://dlang.org/library/std/algorithm/comparison/levenshtein_distance.html

Oh, that could to work, thank you. I've just tested it quickly, but that code seems to produce the same results:

----
size_t similar_text2(in string s1, in string s2, out double percent) {
    import std.algorithm: levenshteinDistance;

    immutable size_t distance = s1.levenshteinDistance(s2);
    immutable size_t len = s1.length + s2.length;
    percent = (len - distance) * 100.0 / len;

    return distance;
}
----

May 10, 2018

Re: D'ish similar_text?

Posted by Vladimir Panteleev
in reply to Dgame

Vladimir Panteleev

Posted in reply to Dgame

On Thursday, 10 May 2018 at 20:32:11 UTC, Dgame wrote:
>     immutable size_t len = s1.length + s2.length;
>     percent = (len - distance) * 100.0 / len;

Note that this formula will give you only 50% similarity for "abc" and "def", i.e. two completely different strings. I suggest to divide by max(s1.length, s2.length) instead.

May 10, 2018

Re: D'ish similar_text?

Posted by Dgame
in reply to Vladimir Panteleev

Dgame

Posted in reply to Vladimir Panteleev

On Thursday, 10 May 2018 at 20:38:12 UTC, Vladimir Panteleev wrote:
> On Thursday, 10 May 2018 at 20:32:11 UTC, Dgame wrote:
>>     immutable size_t len = s1.length + s2.length;
>>     percent = (len - distance) * 100.0 / len;
>
> Note that this formula will give you only 50% similarity for "abc" and "def", i.e. two completely different strings. I suggest to divide by max(s1.length, s2.length) instead.

Hm, that does not work either. ABC and AZB have a different outcome with both. How can I calculate the percentage with levenshtein? It's rather simple with similar_text since it returns the amount of similar chars.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation