Thread overview | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
May 22, 2014 Unnecessary Large Memory Usage of Levenshtein Distance in Phobos | ||||
---|---|---|---|---|
| ||||
I justd discovered that the std.algorithm implementation of Levenshtein Distance requires O(m*n) memory usage. This is not neccessary. I have a C++-implementation of Damerau-Levenshtein that requires only O(3*min(m,n)). Is anybody interested in discussion modifying std.algorithm to make use of this? Here's C++ implementation: template<class T> inline void perm3_231(T& a, T& b, T& c) { T _t=a; a=b; b=c; c=_t; } template<class T> inline pure bool is_min(const T& a) { return a == pnw::minof<T>(); } template<class T> inline pure bool is_max(const T& a) { return a == pnw::maxof<T>(); } template<class T, class D = size_t> inline pure D damerau_levenshtein(const T& s_, const T& t_, D max_distance = std::numeric_limits<D>::max(), D insert_weight = static_cast<D>(10), D delete_weight = static_cast<D>(7), D replace_weight = static_cast<D>(5), D transposition_weight = static_cast<D>(3)) { // reorder s and t to minimize memory usage bool ook = s_.size() >= t_.size(); // argument ordering ok flag const T& s = ook ? s_ : t_; // assure \c s becomes the \em longest const T& t = ook ? t_ : s_; // assure \c t becomes the \em shortest const D m = s.size(); const D n = t.size(); if (m == 0) { return n; } if (n == 0) { return m; } // Adapt the algorithm to use less space, O(3*min(n,m)) instead of O(mn), // since it only requires that the previous row/column and current row/column be stored at // any one time. #ifdef HAVE_C99_VARIABLE_LENGTH_ARRAYS D cc_[n+1], pc_[n+1], sc_[n+1]; // current, previous and second-previous column on stack #elif HAVE_CXX11_UNIQUE_PTR std::unique_ptr<D[]> cc_(new D[n+1]); // current column std::unique_ptr<D[]> pc_(new D[n+1]); // previous column std::unique_ptr<D[]> sc_(new D[n+1]); // second previous column #else auto cc_ = new D[n+1]; // current column auto pc_ = new D[n+1]; // previous column auto sc_ = new D[n+1]; // second previous column //std::vector<D> cc_(n+1), pc_(n+1), sc_(n+1); // current, previous and second previous column #endif D * cc = &cc_[0], * pc = &pc_[0], * sc = &sc_[0]; // pointers for efficient swapping // initialize previous column for (D i = 0; i < n+1; ++i) { pc[i] = i * insert_weight; } // second previous column \c sc will be defined in second \c i iteration in outer-loop const auto D_max = std::numeric_limits<D>::max(); // Computing the Levenshtein distance is based on the observation that if we // reserve a matrix to hold the Levenshtein distances between all prefixes // of the first string and all prefixes of the second, then we can compute // the values in the matrix by flood filling the matrix, and thus find the // distance between the two full strings as the last value computed. // This algorithm, an example of bottom-up dynamic programming, is // discussed, with variants, in the 1974 article The String-to-string // correction problem by Robert A. Wagner and Michael J.Fischer. for (D i = 0; i < m; ++i) { cc[0] = i+insert_weight; auto tmin = D_max; // row/column total min for (D j = 0; j < n; ++j) { // TODO Use sub_dist //auto sub_dist = damerau_levenshtein(s[i], t[j]); // recurse if for example T is an std::vector<std::string> cc[j+1] = pnw::min(pc[j+1] + insert_weight, // insertion cc[j] + delete_weight, // deletion pc[j] + (s[i] == t[j] ? 0 : replace_weight)); // substitution // transposition if (not is_max(transposition_weight)) { // if transposition should be allowed if (i > 0 and j > 0 and // we need at least two characters s[i-1] == t[j] and // and first must be equal second s[i] == t[j-1] // and vice versa ) { cc[j+1] = std::min(cc[j+1], sc[j-1] + transposition_weight); } } if (not is_max(max_distance)) { tmin = std::min(tmin, cc[j+1]); } } if ((not is_max(max_distance)) and tmin >= max_distance) { // if no element is smaller than \p max_distance return max_distance; } if (transposition_weight) { perm3_231(pc, cc, sc); // rotate pointers } else { std::swap(cc, pc); } } #if !(defined(HAVE_C99_VARIABLE_LENGTH_ARRAYS) || defined(HAVE_CXX11_UNIQUE_PTR)) delete [] cc_; delete [] pc_; delete [] sc_; #endif return pc[n]; } /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em sequences \p s and \p t. * Computing LD is also called Optimal String Alignment (OSA). */ template<class T, class D = size_t> inline pure D levenshtein(const T& s, const T& t, D max_distance = std::numeric_limits<D>::max(), D insert_weight = static_cast<D>(10), D delete_weight = static_cast<D>(7), D replace_weight = static_cast<D>(5)) { return damerau_levenshtein(s, t, max_distance, insert_weight, delete_weight, replace_weight, std::numeric_limits<D>::max()); } /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em arrays \p s and \p t. * Computing LD is also called Optimal String Alignment (OSA). */ template<class D = size_t> inline pure D levenshtein(const char * s, const char * t, D max_distance = std::numeric_limits<D>::max(), D insert_weight = static_cast<D>(10), D delete_weight = static_cast<D>(7), D replace_weight = static_cast<D>(5)) { return levenshtein(csc(s), csc(t), max_distance, insert_weight, delete_weight, replace_weight); } /* ---------------------------- Group Separator ---------------------------- */ template<class T, class D = size_t> inline pure D test_levenshtein_symmetry(const T& s, const T& t, D max_distance = std::numeric_limits<D>::max()) { D st = levenshtein(s, t, max_distance, static_cast<D>(1),static_cast<D>(1),static_cast<D>(1)); D ts = levenshtein(t, s, max_distance, static_cast<D>(1),static_cast<D>(1),static_cast<D>(1)); bool sym = (st == ts); // symmetry return sym ? st : std::numeric_limits<D>::max(); } |
May 22, 2014 Re: Unnecessary Large Memory Usage of Levenshtein Distance in Phobos | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nordlöw | You should make a pull request with this implementation adapted to std.algorithm interface
On 2014-05-22 09:49:09 +0000, Nordlöw said:
> I justd discovered that the std.algorithm implementation of Levenshtein Distance requires O(m*n) memory usage.
>
> This is not neccessary. I have a C++-implementation of Damerau-Levenshtein that requires only O(3*min(m,n)). Is anybody interested in discussion modifying std.algorithm to make use of this?
>
> Here's C++ implementation:
>
>
> template<class T> inline void perm3_231(T& a, T& b, T& c) {
> T _t=a; a=b; b=c; c=_t;
> }
> template<class T> inline pure bool is_min(const T& a) {
> return a == pnw::minof<T>();
> }
> template<class T> inline pure bool is_max(const T& a) {
> return a == pnw::maxof<T>();
> }
>
> template<class T, class D = size_t>
> inline pure
> D damerau_levenshtein(const T& s_, const T& t_,
> D max_distance = std::numeric_limits<D>::max(),
> D insert_weight = static_cast<D>(10),
> D delete_weight = static_cast<D>(7),
> D replace_weight = static_cast<D>(5),
> D transposition_weight = static_cast<D>(3))
> {
> // reorder s and t to minimize memory usage
> bool ook = s_.size() >= t_.size(); // argument ordering ok flag
> const T& s = ook ? s_ : t_; // assure \c s becomes the \em longest
> const T& t = ook ? t_ : s_; // assure \c t becomes the \em shortest
>
> const D m = s.size();
> const D n = t.size();
>
> if (m == 0) { return n; }
> if (n == 0) { return m; }
>
> // Adapt the algorithm to use less space, O(3*min(n,m)) instead of O(mn),
> // since it only requires that the previous row/column and current row/column be stored at
> // any one time.
> #ifdef HAVE_C99_VARIABLE_LENGTH_ARRAYS
> D cc_[n+1], pc_[n+1], sc_[n+1]; // current, previous and second-previous column on stack
> #elif HAVE_CXX11_UNIQUE_PTR
> std::unique_ptr<D[]> cc_(new D[n+1]); // current column
> std::unique_ptr<D[]> pc_(new D[n+1]); // previous column
> std::unique_ptr<D[]> sc_(new D[n+1]); // second previous column
> #else
> auto cc_ = new D[n+1]; // current column
> auto pc_ = new D[n+1]; // previous column
> auto sc_ = new D[n+1]; // second previous column
> //std::vector<D> cc_(n+1), pc_(n+1), sc_(n+1); // current, previous and second previous column
> #endif
> D * cc = &cc_[0], * pc = &pc_[0], * sc = &sc_[0]; // pointers for efficient swapping
>
> // initialize previous column
> for (D i = 0; i < n+1; ++i) { pc[i] = i * insert_weight; }
>
> // second previous column \c sc will be defined in second \c i iteration in outer-loop
>
> const auto D_max = std::numeric_limits<D>::max();
>
> // Computing the Levenshtein distance is based on the observation that if we
> // reserve a matrix to hold the Levenshtein distances between all prefixes
> // of the first string and all prefixes of the second, then we can compute
> // the values in the matrix by flood filling the matrix, and thus find the
> // distance between the two full strings as the last value computed.
> // This algorithm, an example of bottom-up dynamic programming, is
> // discussed, with variants, in the 1974 article The String-to-string
> // correction problem by Robert A. Wagner and Michael J.Fischer.
> for (D i = 0; i < m; ++i) {
> cc[0] = i+insert_weight;
> auto tmin = D_max; // row/column total min
> for (D j = 0; j < n; ++j) {
> // TODO Use sub_dist
> //auto sub_dist = damerau_levenshtein(s[i], t[j]); // recurse if for example T is an std::vector<std::string>
> cc[j+1] = pnw::min(pc[j+1] + insert_weight, // insertion
> cc[j] + delete_weight, // deletion
> pc[j] + (s[i] == t[j] ? 0 : replace_weight)); // substitution
>
> // transposition
> if (not is_max(transposition_weight)) { // if transposition should be allowed
> if (i > 0 and j > 0 and // we need at least two characters
> s[i-1] == t[j] and // and first must be equal second
> s[i] == t[j-1] // and vice versa
> ) {
> cc[j+1] = std::min(cc[j+1],
> sc[j-1] + transposition_weight);
> }
> }
>
> if (not is_max(max_distance)) {
> tmin = std::min(tmin, cc[j+1]);
> }
> }
>
> if ((not is_max(max_distance)) and
> tmin >= max_distance) {
> // if no element is smaller than \p max_distance
> return max_distance;
> }
>
> if (transposition_weight) {
> perm3_231(pc, cc, sc); // rotate pointers
> } else {
> std::swap(cc, pc);
> }
> }
>
> #if !(defined(HAVE_C99_VARIABLE_LENGTH_ARRAYS) || defined(HAVE_CXX11_UNIQUE_PTR))
> delete [] cc_;
> delete [] pc_;
> delete [] sc_;
> #endif
> return pc[n];
> }
>
>
> /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em sequences \p s and \p t.
> * Computing LD is also called Optimal String Alignment (OSA).
> */
> template<class T, class D = size_t>
> inline pure
> D levenshtein(const T& s, const T& t,
> D max_distance = std::numeric_limits<D>::max(),
> D insert_weight = static_cast<D>(10),
> D delete_weight = static_cast<D>(7),
> D replace_weight = static_cast<D>(5))
> {
> return damerau_levenshtein(s, t, max_distance, insert_weight, delete_weight, replace_weight,
> std::numeric_limits<D>::max());
> }
>
> /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em arrays \p s and \p t.
> * Computing LD is also called Optimal String Alignment (OSA).
> */
> template<class D = size_t>
> inline pure
> D levenshtein(const char * s, const char * t,
> D max_distance = std::numeric_limits<D>::max(),
> D insert_weight = static_cast<D>(10),
> D delete_weight = static_cast<D>(7),
> D replace_weight = static_cast<D>(5))
> {
> return levenshtein(csc(s),
> csc(t),
> max_distance, insert_weight, delete_weight, replace_weight);
> }
>
> /* ---------------------------- Group Separator ---------------------------- */
>
> template<class T, class D = size_t>
> inline pure
> D test_levenshtein_symmetry(const T& s, const T& t,
> D max_distance = std::numeric_limits<D>::max())
> {
> D st = levenshtein(s, t, max_distance, static_cast<D>(1),static_cast<D>(1),static_cast<D>(1));
> D ts = levenshtein(t, s, max_distance, static_cast<D>(1),static_cast<D>(1),static_cast<D>(1));
> bool sym = (st == ts); // symmetry
> return sym ? st : std::numeric_limits<D>::max();
> }
|
May 22, 2014 Re: Unnecessary Large Memory Usage of Levenshtein Distance in Phobos | ||||
---|---|---|---|---|
| ||||
Posted in reply to Max Klyga Attachments:
| I've begun work on this and my implementation in D passes all the std.algorithm unit tests, but because it now uses a single array instead of a matrix, path() no longer provides the correct answer.
I'm working on trying to amend it so that there is consistency.
On Thu, May 22, 2014 at 12:26 PM, Max Klyga via Digitalmars-d < digitalmars-d@puremagic.com> wrote:
> You should make a pull request with this implementation adapted to std.algorithm interface
>
>
> On 2014-05-22 09:49:09 +0000, Nordlöw said:
>
> I justd discovered that the std.algorithm implementation of Levenshtein
>> Distance requires O(m*n) memory usage.
>>
>> This is not neccessary. I have a C++-implementation of
>> Damerau-Levenshtein that requires only O(3*min(m,n)). Is anybody interested
>> in discussion modifying std.algorithm to make use of this?
>>
>> Here's C++ implementation:
>>
>>
>> template<class T> inline void perm3_231(T& a, T& b, T& c) {
>> T _t=a; a=b; b=c; c=_t;
>> }
>> template<class T> inline pure bool is_min(const T& a) {
>> return a == pnw::minof<T>();
>> }
>> template<class T> inline pure bool is_max(const T& a) {
>> return a == pnw::maxof<T>();
>> }
>>
>> template<class T, class D = size_t>
>> inline pure
>> D damerau_levenshtein(const T& s_, const T& t_,
>> D max_distance = std::numeric_limits<D>::max(),
>> D insert_weight = static_cast<D>(10),
>> D delete_weight = static_cast<D>(7),
>> D replace_weight = static_cast<D>(5),
>> D transposition_weight = static_cast<D>(3))
>> {
>> // reorder s and t to minimize memory usage
>> bool ook = s_.size() >= t_.size(); // argument ordering ok flag
>> const T& s = ook ? s_ : t_; // assure \c s becomes the \em longest
>> const T& t = ook ? t_ : s_; // assure \c t becomes the \em shortest
>>
>> const D m = s.size();
>> const D n = t.size();
>>
>> if (m == 0) { return n; }
>> if (n == 0) { return m; }
>>
>> // Adapt the algorithm to use less space, O(3*min(n,m)) instead of
>> O(mn),
>> // since it only requires that the previous row/column and current
>> row/column be stored at
>> // any one time.
>> #ifdef HAVE_C99_VARIABLE_LENGTH_ARRAYS
>> D cc_[n+1], pc_[n+1], sc_[n+1]; // current, previous and
>> second-previous column on stack
>> #elif HAVE_CXX11_UNIQUE_PTR
>> std::unique_ptr<D[]> cc_(new D[n+1]); // current column
>> std::unique_ptr<D[]> pc_(new D[n+1]); // previous column
>> std::unique_ptr<D[]> sc_(new D[n+1]); // second previous column
>> #else
>> auto cc_ = new D[n+1]; // current column
>> auto pc_ = new D[n+1]; // previous column
>> auto sc_ = new D[n+1]; // second previous column
>> //std::vector<D> cc_(n+1), pc_(n+1), sc_(n+1); // current, previous
>> and second previous column
>> #endif
>> D * cc = &cc_[0], * pc = &pc_[0], * sc = &sc_[0]; // pointers for
>> efficient swapping
>>
>> // initialize previous column
>> for (D i = 0; i < n+1; ++i) { pc[i] = i * insert_weight; }
>>
>> // second previous column \c sc will be defined in second \c i
>> iteration in outer-loop
>>
>> const auto D_max = std::numeric_limits<D>::max();
>>
>> // Computing the Levenshtein distance is based on the observation
>> that if we
>> // reserve a matrix to hold the Levenshtein distances between all
>> prefixes
>> // of the first string and all prefixes of the second, then we can
>> compute
>> // the values in the matrix by flood filling the matrix, and thus
>> find the
>> // distance between the two full strings as the last value computed.
>> // This algorithm, an example of bottom-up dynamic programming, is
>> // discussed, with variants, in the 1974 article The String-to-string
>> // correction problem by Robert A. Wagner and Michael J.Fischer.
>> for (D i = 0; i < m; ++i) {
>> cc[0] = i+insert_weight;
>> auto tmin = D_max; // row/column total min
>> for (D j = 0; j < n; ++j) {
>> // TODO Use sub_dist
>> //auto sub_dist = damerau_levenshtein(s[i], t[j]); //
>> recurse if for example T is an std::vector<std::string>
>> cc[j+1] = pnw::min(pc[j+1] + insert_weight, //
>> insertion
>> cc[j] + delete_weight, //
>> deletion
>> pc[j] + (s[i] == t[j] ? 0 :
>> replace_weight)); // substitution
>>
>> // transposition
>> if (not is_max(transposition_weight)) { // if transposition
>> should be allowed
>> if (i > 0 and j > 0 and // we need at least two
>> characters
>> s[i-1] == t[j] and // and first must be equal second
>> s[i] == t[j-1] // and vice versa
>> ) {
>> cc[j+1] = std::min(cc[j+1],
>> sc[j-1] + transposition_weight);
>> }
>> }
>>
>> if (not is_max(max_distance)) {
>> tmin = std::min(tmin, cc[j+1]);
>> }
>> }
>>
>> if ((not is_max(max_distance)) and
>> tmin >= max_distance) {
>> // if no element is smaller than \p max_distance
>> return max_distance;
>> }
>>
>> if (transposition_weight) {
>> perm3_231(pc, cc, sc); // rotate pointers
>> } else {
>> std::swap(cc, pc);
>> }
>> }
>>
>> #if !(defined(HAVE_C99_VARIABLE_LENGTH_ARRAYS) ||
>> defined(HAVE_CXX11_UNIQUE_PTR))
>> delete [] cc_;
>> delete [] pc_;
>> delete [] sc_;
>> #endif
>> return pc[n];
>> }
>>
>>
>> /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em
>> sequences \p s and \p t.
>> * Computing LD is also called Optimal String Alignment (OSA).
>> */
>> template<class T, class D = size_t>
>> inline pure
>> D levenshtein(const T& s, const T& t,
>> D max_distance = std::numeric_limits<D>::max(),
>> D insert_weight = static_cast<D>(10),
>> D delete_weight = static_cast<D>(7),
>> D replace_weight = static_cast<D>(5))
>> {
>> return damerau_levenshtein(s, t, max_distance, insert_weight,
>> delete_weight, replace_weight,
>> std::numeric_limits<D>::max());
>> }
>>
>> /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em
>> arrays \p s and \p t.
>> * Computing LD is also called Optimal String Alignment (OSA).
>> */
>> template<class D = size_t>
>> inline pure
>> D levenshtein(const char * s, const char * t,
>> D max_distance = std::numeric_limits<D>::max(),
>> D insert_weight = static_cast<D>(10),
>> D delete_weight = static_cast<D>(7),
>> D replace_weight = static_cast<D>(5))
>> {
>> return levenshtein(csc(s),
>> csc(t),
>> max_distance, insert_weight, delete_weight,
>> replace_weight);
>> }
>>
>> /* ---------------------------- Group Separator
>> ---------------------------- */
>>
>> template<class T, class D = size_t>
>> inline pure
>> D test_levenshtein_symmetry(const T& s, const T& t,
>> D max_distance =
>> std::numeric_limits<D>::max())
>> {
>> D st = levenshtein(s, t, max_distance, static_cast<D>(1),static_cast<
>> D>(1),static_cast<D>(1));
>> D ts = levenshtein(t, s, max_distance, static_cast<D>(1),static_cast<
>> D>(1),static_cast<D>(1));
>> bool sym = (st == ts); // symmetry
>> return sym ? st : std::numeric_limits<D>::max();
>> }
>>
>
>
>
|
May 22, 2014 Re: Unnecessary Large Memory Usage of Levenshtein Distance in Phobos | ||||
---|---|---|---|---|
| ||||
Posted in reply to Max Klyga | On 5/22/14, 4:26 AM, Max Klyga wrote:
> You should make a pull request with this implementation adapted to
> std.algorithm interface
Yes please. I recall I wanted to add that optimization later but forgot about it.
Andrei
|
May 22, 2014 Re: Unnecessary Large Memory Usage of Levenshtein Distance in Phobos | ||||
---|---|---|---|---|
| ||||
Posted in reply to James Carr | > I've begun work on this and my implementation in D passes all I uploaded it here https://github.com/nordlow/justcxx/blob/master/levenshtein_distance.hpp I don't have time right now to look into adapting it to Phobos. But I hope it can give some clues for your work ;) /Per |
May 22, 2014 Re: Unnecessary Large Memory Usage of Levenshtein Distance in Phobos | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nordlöw Attachments:
| I've submitted a pull request. https://github.com/D-Programming-Language/phobos/pull/2195 On Thu, May 22, 2014 at 8:05 PM, "Nordlöw" <digitalmars-d@puremagic.com>wrote: > I've begun work on this and my implementation in D passes all >> > > I uploaded it here > > https://github.com/nordlow/justcxx/blob/master/levenshtein_distance.hpp > > I don't have time right now to look into adapting it to Phobos. But I hope it can give some clues for your work ;) > > /Per > |
May 25, 2014 Re: Unnecessary Large Memory Usage of Levenshtein Distance in Phobos | ||||
---|---|---|---|---|
| ||||
Attachments:
| Hi again, I've implemented the memory efficient version https://github.com/D-Programming-Language/phobos/pull/2195 I was wondering if I could get some feedback please. Thanks. On Thu, May 22, 2014 at 8:42 PM, James Carr <jdcarrman@gmail.com> wrote: > I've submitted a pull request. > > https://github.com/D-Programming-Language/phobos/pull/2195 > > > On Thu, May 22, 2014 at 8:05 PM, "Nordlöw" <digitalmars-d@puremagic.com>wrote: > >> I've begun work on this and my implementation in D passes all >>> >> >> I uploaded it here >> >> https://github.com/nordlow/justcxx/blob/master/levenshtein_distance.hpp >> >> I don't have time right now to look into adapting it to Phobos. But I hope it can give some clues for your work ;) >> >> /Per >> > > |
July 12, 2014 Re: Unnecessary Large Memory Usage of Levenshtein Distance in Phobos | ||||
---|---|---|---|---|
| ||||
Posted in reply to James Carr | On Sunday, 25 May 2014 at 22:22:28 UTC, James Carr via Digitalmars-d wrote:
> Hi again,
>
> I've implemented the memory efficient version
> https://github.com/D-Programming-Language/phobos/pull/2195
>
> I was wondering if I could get some feedback please.
>
> Thanks.
This PR has been sitting around for over a month without anybody doing anything about it. Can we get a Phobos dev to take a look?
|
Copyright © 1999-2021 by the D Language Foundation