Thread overview | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
October 01, 2014 Find Semantically Correct Word Splits in UTF-8 Strings | ||||
---|---|---|---|---|
| ||||
I'm looking for a way to make my algorithm S[] findWordSplit(S)(S word, HLang[] langs = []) { for (size_t i = 1; i + 1 < word.length; i++) { const first = word[0..i]; const second = word[i..$]; if (this.canMeanSomething(first, langs) && this.canMeanSomething(second, langs)) { return [first, second]; } } return typeof(return).init; } correctly work if S is a (UTF-8) string without first, in lazy manner, encode word to a dstring. Currently this algorithm works as "carwash" => ["car", "wash"] and I would like it to work correctly and efficient in my native language aswell as "biltvätt" => ["bil", "tvätt"] :) |
October 01, 2014 Re: Find Semantically Correct Word Splits in UTF-8 Strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nordlöw | On Wednesday, 1 October 2014 at 11:06:24 UTC, Nordlöw wrote:
> I'm looking for a way to make my algorithm
>
Update:
S[] findMeaningfulWordSplit(S)(S word,
HLang[] langs = []) if (isSomeString!S)
{
for (size_t i = 1; i + 1 < word.length; i++)
{
const first = word.takeExactly(i).to!string;
const second = word.dropExactly(i).to!string;
if (this.canMeanSomething(first, langs) &&
this.canMeanSomething(second, langs))
{
return [first,
second];
}
}
return typeof(return).init;
}
works but has unfortunately O(word.length^^2) time-complexity.
Do we need a new InputRange algorithm for this?
|
October 01, 2014 Re: Find Semantically Correct Word Splits in UTF-8 Strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nordlöw | On Wednesday, 1 October 2014 at 11:47:41 UTC, Nordlöw wrote:
> Do we need a new InputRange algorithm for this?
If so, how about naming it SlidingSplitter or perhaps SlidingHalver in the binary case?
I haven't thought about making this variadic on the number of splits (>= 1) but that could be useful in my application aswell.
|
October 01, 2014 Re: Find Semantically Correct Word Splits in UTF-8 Strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nordlöw | S[] findMeaningfulWordSplit(S)(S word, HLang[] langs = []) if (isSomeString!S) { S second = word; for (size_t i = 1; i + 1 < word.length; i++) { second = second.dropExactly(i).to!string; const first = word[0..$-second.length]; if (this.canMeanSomething(first, langs) && this.canMeanSomething(second, langs)) { return [first, second]; } } return typeof(return).init; } |
October 01, 2014 Re: Find Semantically Correct Word Splits in UTF-8 Strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nordlöw | On Wednesday, 1 October 2014 at 11:47:41 UTC, Nordlöw wrote: > On Wednesday, 1 October 2014 at 11:06:24 UTC, Nordlöw wrote: >> I'm looking for a way to make my algorithm >> > > Update: > > S[] findMeaningfulWordSplit(S)(S word, > HLang[] langs = []) if (isSomeString!S) > { > for (size_t i = 1; i + 1 < word.length; i++) > { > const first = word.takeExactly(i).to!string; > const second = word.dropExactly(i).to!string; > if (this.canMeanSomething(first, langs) && > this.canMeanSomething(second, langs)) > { > return [first, > second]; > } > } > return typeof(return).init; > } > > works but has unfortunately O(word.length^^2) time-complexity. > > Do we need a new InputRange algorithm for this? This seems like a text-book case for a trie algorithm: http://en.wikipedia.org/wiki/Trie Once your "dictionary" is built, it can find *all* the splits in O(word.length). ...but I don't know how well it adapts to the range interface. Should still work though. Also, Aho–Corasick might be relevant depending on what you want to do, for example, find all substrings that happen to be a word, regardless of what comes before or after: http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm |
October 01, 2014 Re: Find Semantically Correct Word Splits in UTF-8 Strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nordlöw | On Wednesday, 1 October 2014 at 11:06:24 UTC, Nordlöw wrote:
> I'm looking for a way to make my algorithm
>
> S[] findWordSplit(S)(S word,
> HLang[] langs = [])
> {
> for (size_t i = 1; i + 1 < word.length; i++)
> {
> const first = word[0..i];
> const second = word[i..$];
> if (this.canMeanSomething(first, langs) &&
> this.canMeanSomething(second, langs))
> {
> return [first,
> second];
> }
> }
> return typeof(return).init;
> }
>
> correctly work if S is a (UTF-8) string without first, in lazy manner, encode word to a dstring.
>
> Currently this algorithm works as
>
> "carwash" => ["car", "wash"]
>
> and I would like it to work correctly and efficient in my native language aswell as
>
> "biltvätt" => ["bil", "tvätt"]
>
> :)
Out of curiosity, why exactly isn't it working in your "native language"? If you avoid decoding in your "canMeanSomething", you should encounter no problems.
|
October 01, 2014 Re: Find Semantically Correct Word Splits in UTF-8 Strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On Wednesday, 1 October 2014 at 16:44:24 UTC, monarch_dodra wrote:
> language"? If you avoid decoding in your "canMeanSomething", you should encounter no problems.
You're right. I'll try that.
|
October 01, 2014 Re: Find Semantically Correct Word Splits in UTF-8 Strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nordlöw | On Wednesday, 1 October 2014 at 11:47:41 UTC, Nordlöw wrote:
> On Wednesday, 1 October 2014 at 11:06:24 UTC, Nordlöw wrote:
>> I'm looking for a way to make my algorithm
>>
>
> Update:
>
> S[] findMeaningfulWordSplit(S)(S word,
> HLang[] langs = []) if (isSomeString!S)
> {
> for (size_t i = 1; i + 1 < word.length; i++)
> {
> const first = word.takeExactly(i).to!string;
Does that even work? takeExactly would pop up to N *codepoints*, whereas your string only has N *codeunits*.
Something like:
for (auto second = str ; !second.empty ; second.popFront() )
{
auto first = str[0 .. $ - second.length];
...
}
//special case str + str[$ .. $] here. (or adapt your loop)
Would also be unicode correct, without increasing the original complexity.
|
October 01, 2014 Re: Find Semantically Correct Word Splits in UTF-8 Strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On Wednesday, 1 October 2014 at 17:09:57 UTC, monarch_dodra wrote:
> Does that even work? takeExactly would pop up to N *codepoints*, whereas your string only has N *codeunits*.
Your're right again :)
If forgot that takeExactly auto-decodes.
|
October 02, 2014 Re: Find Semantically Correct Word Splits in UTF-8 Strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nordlöw | On Wednesday, 1 October 2014 at 21:34:40 UTC, Nordlöw wrote:
> On Wednesday, 1 October 2014 at 17:09:57 UTC, monarch_dodra wrote:
>> Does that even work? takeExactly would pop up to N *codepoints*, whereas your string only has N *codeunits*.
>
> Your're right again :)
>
> If forgot that takeExactly auto-decodes.
Technically, it only pops. It's front/popFront that auto-decode.
|
Copyright © 1999-2021 by the D Language Foundation