May 07, 2010
On Fri, 7 May 2010, Lionello Lunesu wrote:

> On 6-5-2010 22:37, Michel Fortin wrote:
> > On 2010-05-05 23:45:50 -0400, Walter Bright <newshound1@digitalmars.com> said:
> > 
> >> Walter Bright wrote:
> >>> Alex Makhotin wrote:
> >>>> It takes ~40 seconds 50% load on the dual core processor(CentOS 5.3 kernel 2.6.32.4), to get the actual error messages about the undefined identifier.
> >>>
> >>> Definitely there's a problem.
> >>
> >> The problem is the spell checker is O(n*n) on the number of characters
> >> in the undefined identifier.
> > 
> > That's an algorithm that can't scale then.
> > 
> > Checking the Levenshtein distance for each known identifier within a small difference in length would be a better idea. (Clang is said to use the Levenshtein distance, it probably does something of the sort.)
> > 
> > http://en.wikipedia.org/wiki/Levenshtein_distance
> > 
> and especially this line:
> 
> # If we are only interested in the distance if it is smaller than a threshold k, then it suffices to compute a diagonal stripe of width 2k+1 in the matrix. In this way, the algorithm can be run in O(kl) time, where l is the length of the shortest string.

The source for this is pretty isolated.. anyone want to volunteer take a shot at improving this part of dmd?

Later,
Brad
May 07, 2010
Walter Bright wrote:
> I recompiled dmd with the profiler (-gt switch) which confirmed it.

For those interested, try out changeset 470.
May 07, 2010
On 7-5-2010 9:10, Brad Roberts wrote:
> On Fri, 7 May 2010, Lionello Lunesu wrote:
> 
>> On 6-5-2010 22:37, Michel Fortin wrote:
>>> On 2010-05-05 23:45:50 -0400, Walter Bright <newshound1@digitalmars.com> said:
>>>
>>>> Walter Bright wrote:
>>>>> Alex Makhotin wrote:
>>>>>> It takes ~40 seconds 50% load on the dual core processor(CentOS 5.3 kernel 2.6.32.4), to get the actual error messages about the undefined identifier.
>>>>>
>>>>> Definitely there's a problem.
>>>>
>>>> The problem is the spell checker is O(n*n) on the number of characters
>>>> in the undefined identifier.
>>>
>>> That's an algorithm that can't scale then.
>>>
>>> Checking the Levenshtein distance for each known identifier within a small difference in length would be a better idea. (Clang is said to use the Levenshtein distance, it probably does something of the sort.)
>>>
>>> http://en.wikipedia.org/wiki/Levenshtein_distance
>>>
>> and especially this line:
>>
>> # If we are only interested in the distance if it is smaller than a threshold k, then it suffices to compute a diagonal stripe of width 2k+1 in the matrix. In this way, the algorithm can be run in O(kl) time, where l is the length of the shortest string.
> 
> The source for this is pretty isolated.. anyone want to volunteer take a shot at improving this part of dmd?
> 
> Later,
> Brad
I see, speller.c, I'll have a look..
May 07, 2010
Walter Bright wrote:
> Walter Bright wrote:
>> I recompiled dmd with the profiler (-gt switch) which confirmed it.
> 
> For those interested, try out changeset 470.

On my timing tests, the time spent is linear with the number of characters in the identifier. It's still too slow, though.
May 09, 2010
On 7-5-2010 12:01, Lionello Lunesu wrote:
> On 7-5-2010 9:10, Brad Roberts wrote:
>> On Fri, 7 May 2010, Lionello Lunesu wrote:
>>
>>> On 6-5-2010 22:37, Michel Fortin wrote:
>>>> On 2010-05-05 23:45:50 -0400, Walter Bright <newshound1@digitalmars.com> said:
>>>>
>>>>> Walter Bright wrote:
>>>>>> Alex Makhotin wrote:
>>>>>>> It takes ~40 seconds 50% load on the dual core processor(CentOS 5.3 kernel 2.6.32.4), to get the actual error messages about the undefined identifier.
>>>>>>
>>>>>> Definitely there's a problem.
>>>>>
>>>>> The problem is the spell checker is O(n*n) on the number of characters
>>>>> in the undefined identifier.
>>>>
>>>> That's an algorithm that can't scale then.
>>>>
>>>> Checking the Levenshtein distance for each known identifier within a small difference in length would be a better idea. (Clang is said to use the Levenshtein distance, it probably does something of the sort.)
>>>>
>>>> http://en.wikipedia.org/wiki/Levenshtein_distance
>>>>
>>> and especially this line:
>>>
>>> # If we are only interested in the distance if it is smaller than a threshold k, then it suffices to compute a diagonal stripe of width 2k+1 in the matrix. In this way, the algorithm can be run in O(kl) time, where l is the length of the shortest string.
>>
>> The source for this is pretty isolated.. anyone want to volunteer take a shot at improving this part of dmd?
>>
>> Later,
>> Brad
> I see, speller.c, I'll have a look..

I'm in the middle of moving from one city to another so don't wait for me. I have attached the D version of the code in the wikipedia article (including the patch for transpositions.)

It's not straightforward to drop this in (apart from it being in D), since speller.c creates all variations on a string (=inefficient) and uses a callback function to check if a variation is a valid symbol.

I'll have more time to look at this next week.

L.


May 09, 2010
That code is in the public domain, by the way.

DMD should require a copyright notice in each source file :)

L.
May 10, 2010
On Sun, 09 May 2010 02:11:21 -0400, Lionello Lunesu <lio@lunesu.remove.com> wrote:


> I'm in the middle of moving from one city to another so don't wait for
> me. I have attached the D version of the code in the wikipedia article
> (including the patch for transpositions.)
>
> It's not straightforward to drop this in (apart from it being in D),
> since speller.c creates all variations on a string (=inefficient) and
> uses a callback function to check if a variation is a valid symbol.
>
> I'll have more time to look at this next week.

Several others have privately brought up this problem to Walter.  He does not want to change how the symbol lookup tables work, and there is no way to iterate them.  Therefore, without a way to iterate current symbols, this is the only way the algo can be written.

However, according to reports on the latest beta, he's sped up the lookup times for symbols significantly enough to perceptually reduce the problem.

Because of the nature of the algorithm, the longer the invalid symbol, the slower the algorithm.  It would be interesting to see a comparison between the current beta code and code that does a full iteration with very long symbols.  I don't know if anyone wants to look at modifying the symbol lookup data structures to allow iteration, it may be perceptually insignificant, and unimportant for most developers.

A quick test on a long symbol name:


void s023456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567891() {}
void main()
{
     s123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890();
}



And on the various compilers:

2.043: 0.052s, does not suggest
2.045: 5.4s, suggests the alternative
beta: 0.8s, does not suggest

2.043 only does spell checking with a Levenshtein distance of 1, 2.045 does 2, but is extremely slow.  The beta does a distance of 2, but only if the errors are close together (Walter added this as an optimization to remove one factor from the runtime complexity).

So clearly, there is still room for improvement, I think with a proper symbol iteration, we could get the time down to be as quick as 2.043 or faster *and* provide the ability to check for a complete LD of 2 where the errors are not close together.  We might be able to even push the LD to 3 or 4.

I've thought about the optimization of errors close together being checked, and I think the counter case is the case which takes the longest -- a long symbol.  Typically people use camelCase to denote symbols, what if two of the words in the symbol were misspelled by one character (or a capitalization was forgotten)?

It may not be an issue, the spell checker is simply a nice hint, but isn't essential to determine errors.

-Steve
May 10, 2010
Steven Schveighoffer wrote:
> On Sun, 09 May 2010 02:11:21 -0400, Lionello Lunesu <lio@lunesu.remove.com> wrote:
> 
> 
>> I'm in the middle of moving from one city to another so don't wait for
>> me. I have attached the D version of the code in the wikipedia article
>> (including the patch for transpositions.)
>>
>> It's not straightforward to drop this in (apart from it being in D),
>> since speller.c creates all variations on a string (=inefficient) and
>> uses a callback function to check if a variation is a valid symbol.
>>
>> I'll have more time to look at this next week.
> 
> Several others have privately brought up this problem to Walter.  He does not want to change how the symbol lookup tables work, and there is no way to iterate them.  Therefore, without a way to iterate current symbols, this is the only way the algo can be written.
> 
> However, according to reports on the latest beta, he's sped up the lookup times for symbols significantly enough to perceptually reduce the problem.
> 
> Because of the nature of the algorithm, the longer the invalid symbol, the slower the algorithm.  It would be interesting to see a comparison between the current beta code and code that does a full iteration with very long symbols.  I don't know if anyone wants to look at modifying the symbol lookup data structures to allow iteration, it may be perceptually insignificant, and unimportant for most developers.
> 
> A quick test on a long symbol name:
> 
> 
> void s023456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567891() {}
> void main()
> {
>      s123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890(); 
> 
> }
> 
> 
> 
> And on the various compilers:
> 
> 2.043: 0.052s, does not suggest
> 2.045: 5.4s, suggests the alternative
> beta: 0.8s, does not suggest
> 
> 2.043 only does spell checking with a Levenshtein distance of 1, 2.045 does 2, but is extremely slow.  The beta does a distance of 2, but only if the errors are close together (Walter added this as an optimization to remove one factor from the runtime complexity).
> 
> So clearly, there is still room for improvement, I think with a proper symbol iteration, we could get the time down to be as quick as 2.043 or faster *and* provide the ability to check for a complete LD of 2 where the errors are not close together.  We might be able to even push the LD to 3 or 4.
> 
> I've thought about the optimization of errors close together being checked, and I think the counter case is the case which takes the longest -- a long symbol.  Typically people use camelCase to denote symbols, what if two of the words in the symbol were misspelled by one character (or a capitalization was forgotten)?
> 
> It may not be an issue, the spell checker is simply a nice hint, but isn't essential to determine errors.
> 
> -Steve

I can't imagine why something as simple as iterating a symbol lookup table can't be implemented.

(in fact I iterate that lookup table in Descent to show autocompletions... I don't generate every possible string in the world and see if it's a symbol, and if so, suggest it as an autocompletion... hmmm...)
May 10, 2010
On Mon, 10 May 2010 09:22:21 -0400, Ary Borenszweig <ary@esperanto.org.ar> wrote:

> Steven Schveighoffer wrote:
>
>>  Several others have privately brought up this problem to Walter.  He does not want to change how the symbol lookup tables work, and there is no way to iterate them.
>
> I can't imagine why something as simple as iterating a symbol lookup table can't be implemented.
>
> (in fact I iterate that lookup table in Descent to show autocompletions... I don't generate every possible string in the world and see if it's a symbol, and if so, suggest it as an autocompletion... hmmm...)

On that point I'm just the messenger, I'm not sure why Walter has that position, and I don't know enough about the code to know how to fix it.

-Steve
May 10, 2010
Hello Steven,

> Several others have privately brought up this problem to Walter. He
> does not want to change how the symbol lookup tables work, and there
> is no way to iterate them.
> 

Is it fundamentally impossible to iterate or is the code just not there and/or nasty to write?

-- 
... <IXOYE><