August 03, 2005
On Wed, 3 Aug 2005 19:37:21 +0000 (UTC), jicman wrote:


[snip]
> I am
> not just searching for one letter.

I have a much faster string finding routine coded up in Build. It uses the Boyer-Moore algorithm. Here it is ...

<code>
template BMScan(T) {
int BMScan(T[] pContainer, T[] pFind, uint pStartPos=0)
{
    int[T] lShift; // Shift table.
    int lCP;    // Index into container array
    int lCE;    // Current high-water mark in containter array
    int lFP;    // Index into find array

    // Do length checks first.
    if (pFind.length == 0)
        return -1;
    if (pStartPos > pContainer.length)
        return -1;
    if (pStartPos < 0)
        return -1;
    if (pStartPos > 0)
        pContainer = pContainer[pStartPos .. $];
    if (pContainer.length == 0)
        return -1;
    if (pFind.length > pContainer.length)
        return -1;

    // If find string is only one element long, do a brute force scan.
    if (pFind.length == 1)
    {
        T pElem = pFind[0];
        foreach(int i, T lElem; pContainer[pStartPos .. $])
        {
            if ( lElem == pElem)
                return i + pStartPos;
        }
        return -1;
    }

    // Prepare shift table.
    foreach(int i, T c; pFind)
    {
        lShift[c] = pFind.length - i - 1;
    }

    // Start scan at rightmost position.
    lCP = pFind.length - 1;
    while(lCP < pContainer.length)
    {
        lFP = pFind.length - 1;
        assert(lCP == 0 || lCP > lCE);
        lCE = lCP;  // Remember where I'm up to so far.

        if (pContainer[lCP] != pFind[lFP])
        {   // Found a mismatch, so realign the arrays.
            if ((pContainer[lCP] in lShift) == null)
            {
                // This element is not anywhere in the Find array
                // so we can skip over it to align up just beyond it.
                lCP += pFind.length;
            }
            else
            {
                // This element is somewhere in the Find array
                // so if we haven't checked all Find elements we
                // can slide up to align with the rightmost
                // position where the element occurs in the Find
                // array. However, if we have checked the entire
                // Find array, then it isn't in the Container array.
                if (lFP == 0)
                    return -1;
                lCP += lShift[pContainer[lCP]];
            }
        }
        else
        {   // Found a match, so scan right-to-left...
            // unless this we are at the first Find array element
            // in which case we have found it.
            if (lFP == 0)
               return lCP + pStartPos;

           // Keep scanning until we run out of elements in the Find array.
            while(lFP > 0)
            {
                // Point to next pair to check.
                lFP--;
                lCP--;

                if (pContainer[lCP] != pFind[lFP])
                {   // Found a mismatch, so realign the arrays.
                    if ((pContainer[lCP] in lShift) == null)
                    {
                      // This element is not anywhere in the Find array
                      // so we can skip over it to align up just beyond it.
                        lCP += pFind.length;
                    }
                    else
                    {
                        // This element is somewhere in the Find array
                        // so if we haven't checked all Find elements we
                        // can realign the match so far beyond the current
                        // 'block'.
                        // However, if we have checked the entire
                       // Find array, then it isn't in the Container array.
                        if (lFP == 0)
                            return -1;
                        lCP = lCE + (pFind.length - lFP - 1);
                    }
                    // Stop element-by-element scanning and resume
                    // block-end checking.
                    break;
                }

                if (lFP == 0)
                    // We are at the begining of the Find array so
                    // that means we found a complete match!
                    return lCP + pStartPos;
            }
        }

    }
    // If we get here, then there is no match.
    return -1;
  }
}

</code>

And you use it like the phobos find() function ...

<code>
  if ( BMScan!(char)(line,"Loading ApplicationJarEntities") > -1) ...
</code>

-- 
Derek Parnell
Melbourne, Australia
4/08/2005 7:05:14 AM
August 03, 2005
Manfred Nowak <svv1999@hotmail.com> wrote:

> you have to implement a lexer.

For this task you can use re2c from r2c.org.

A partially filled template for your taks would look like this:

<code>
alias char YYCTYPE;
YYCTYPE* YYCURSOR, YYLIMIT, YYMARKER;
char[] choose( char[] line){
  void yyLimitAdjust(){
    YYLIMIT= &line[ line.length-1];
    YYLIMIT++;
  }
  void YYFILL( int n){
    uint now= line.length;
    line.length= now + n;
    yyLimitAdjust();
    for( uint i= now; i< line.length; i++)
      line[ i]= ' ';
  }
  YYCURSOR= &line[0];
  yyLimitAdjust();
  try{
    while( 1)
/*!re2c
"JUNKDATA"	{ return "JUNKDATA".dup;}
"Auto stopping job"	{ return "AutoStopJob".dup;}
"Waiting for ".*"job""s"?" to complete"	{ return
"JobsBackedUp".dup; }
'\n'	{}
.	{}
*/
  }
  catch{ return "NotUsableData";};
}
unittest{
  assert( choose( "JUNKDAT") == "NotUsableData");
  assert( choose( "JUNKDATA") == "JUNKDATA");
  assert( choose( "Waiting for one job to complete") ==
"JobsBackedUp");
  assert( choose( "Waiting for 20 jobs to complete") ==
"JobsBackedUp");
}
void main(){
}
</char>

This run through re2c can be fed into dmd nearly as is.

-manfred
August 03, 2005
"Derek Parnell" <derek@psych.ward> wrote in message news:9lcyd77i1y0t.uvlaprrcooxk.dlg@40tude.net...
> On Wed, 3 Aug 2005 19:37:21 +0000 (UTC), jicman wrote:
>
>
> [snip]
>> I am
>> not just searching for one letter.
>
> I have a much faster string finding routine coded up in Build. It uses the Boyer-Moore algorithm. Here it is ...

Hmm. when I try it on strings pContainer of length between 50 and 150 with a pFind of "this is a test of phobos" I get std.string.find outperforms the posted code by an order of magnitude and it seems to get worse as the GC load increses (I notice the lShift is an assoc array which will allocate nodes for every unique key on every call). What sorts of problem sizes are appropriate for BMScan?



August 04, 2005
> "JUNKDATA" { return "JUNKDATA".dup;}
> "Auto stopping job" { return "AutoStopJob".dup;}
> "Waiting for ".*"job""s"?" to complete" { return
> "JobsBackedUp".dup; }

any particular reason for the .dups?


August 04, 2005
>    int[T] lShift; // Shift table.
>    int lCP;    // Index into container array
>    int lCE;    // Current high-water mark in containter array
>    int lFP;    // Index into find array

I admit this is a truly nit-picky complaint but the "l" prefix above took me a while to figure out - I thought it was an I for "index". I only really confirmed those were l's when I changed my font.


August 04, 2005
"Ben Hinkle" <ben.hinkle@gmail.com> wrote:

[...]
> any particular reason for the .dups?

references to local data aren't allowed, are they?

-manfred
August 04, 2005
"Manfred Nowak" <svv1999@hotmail.com> wrote in message news:dcrovi$23ik$1@digitaldaemon.com...
> "Ben Hinkle" <ben.hinkle@gmail.com> wrote:
>
> [...]
>> any particular reason for the .dups?
>
> references to local data aren't allowed, are they?
>
> -manfred

String literals are stored in the shared data segment so they can be safely returned from functions (like in C). The local info on the stack is the ptr to the data segment and length to the string data and that is being copied to the return value. If the function returned the address of a local variable or a string stored locally on the stack then that would indeed be illegal.


August 04, 2005
On Wed, 3 Aug 2005 19:58:08 -0400, Ben Hinkle wrote:

> "Derek Parnell" <derek@psych.ward> wrote in message news:9lcyd77i1y0t.uvlaprrcooxk.dlg@40tude.net...
>> On Wed, 3 Aug 2005 19:37:21 +0000 (UTC), jicman wrote:
>>
>>
>> [snip]
>>> I am
>>> not just searching for one letter.
>>
>> I have a much faster string finding routine coded up in Build. It uses the Boyer-Moore algorithm. Here it is ...
> 
> Hmm. when I try it on strings pContainer of length between 50 and 150 with a pFind of "this is a test of phobos" I get std.string.find outperforms the posted code by an order of magnitude and it seems to get worse as the GC load increses (I notice the lShift is an assoc array which will allocate nodes for every unique key on every call). What sorts of problem sizes are appropriate for BMScan?

Ouch!

I've just tried some time trials myself and I have to say that phobos is faster in all cases, even when I have to do conversions of wchar[] and dchar[]. I also tried to remove the AA dependancies but the best I could do was to come in a half phobos's speed.

Conclusion: Don't use the BMScan code I supplied. There is something not quite right about.

On the lower-case 'L' prefix issue, I use a modified Courier font that displays this character's glyph as a lower-case "T" but without the bar. The net effect is a character that has a curved base and is easily distinguishable from both 'i' and 1. The same modified font also displays a zero with dot in the center of the oval to avoid confusion with 'O' and 'o'.

Sorry about that ... ;-)
-- 
Derek
Melbourne, Australia
4/08/2005 1:21:06 PM
August 04, 2005
"Ben Hinkle" <ben.hinkle@gmail.com> wrote:

> String literals are stored in the shared data segment so they can be safely returned from functions (like in C).

I cannot find any hint on this in the specs. So it is current but undocumented and therefore unpromised behaviour.

Furthermore and as far as I understand, the 64-bit OS's have given up the idea of segments. If this cannot be reestablished, the move to 64-bit will break masses of D-code?

-manfred
August 04, 2005
"Manfred Nowak" <svv1999@hotmail.com> wrote in message news:dcs6lm$2eod$1@digitaldaemon.com...
> "Ben Hinkle" <ben.hinkle@gmail.com> wrote:
>
>> String literals are stored in the shared data segment so they can be safely returned from functions (like in C).
>
> I cannot find any hint on this in the specs. So it is current but undocumented and therefore unpromised behaviour.

Since C behaves that way Walter probably assumed people would assume D behaves that way, too. It would be nice to have it mentioned in the doc, though.

> Furthermore and as far as I understand, the 64-bit OS's have given up the idea of segments. If this cannot be reestablished, the move to 64-bit will break masses of D-code?

That would make existing C code break, too, so I doubt it will be a problem.

> -manfred