Jump to page: 1 2
Thread overview
[Issue 7515] New: The new std.string.translate is slow for ASCII text
Mar 06, 2012
Jonathan M Davis
Mar 06, 2012
Jonathan M Davis
Mar 07, 2012
Jonathan M Davis
Mar 11, 2012
Jonathan M Davis
Mar 11, 2012
Jonathan M Davis
Mar 11, 2012
Jonathan M Davis
May 28, 2012
Jonathan M Davis
Jun 11, 2012
Jonathan M Davis
February 16, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515

           Summary: The new std.string.translate is slow for ASCII text
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nobody@puremagic.com
        ReportedBy: bearophile_hugs@eml.cc


--- Comment #0 from bearophile_hugs@eml.cc 2012-02-15 19:23:51 PST ---
std.string.maketrans() is going to be deprecated. This is a demo program, it's meant to work on a good amount of pure 7-bit ASCII text.

The conversion table is the same as the one in this page: http://digitalmars.com/d/1.0/lisp-java-d.html


import std.stdio, std.string;
void main() {
    char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII text
    auto tab =
maketrans("ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ",

"0111222333445566677788899901112223334455666777888999");
    auto text2 = text.translate(tab, "\"");
    writeln(text2);
}


Using the new std.string.maketrans the code becomes something like:


import std.stdio, std.string, std.range;
void main() {
    char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII text
    dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6',
    'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9',
    'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7',
    'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8',
    'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3',
    'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2',
    'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9'];
    auto text3 = text.translate(aa, "\"");
    writeln(text3);
}


The code with the associative array is _uglier_, _longer_ to write and read, and every char in the original string requires one or two associative array lookups. This is quite inefficient for sizeable ASCII strings.



Jonathan M Davis has asked me a benchmark:

> Do you have data to backup that there is a significant speed difference?

So I have written the following small benchmark, it contains both the URL to the test data (bible) and the timings I am seeing on a slow PC. If your PC is faster feel free to use it on more data (creating a larger input file on disk, concatenating many copies of the same test data). If you see bugs in the benchmark please tell me.

Notes:
- version3 uses the old translate;
- version4 uses the new translate;
- version1InPlace is a reference point, it works in-place;
- version2InPlace is similar, uses a linear scan for the chars to remove, and
it seems a bit slower than version1InPlace (probably because the small deltab
array fits well in the L1 cache).

If both the benchmark and the timings are correct, the old translate seems almost 5 times faster than the new one (and this is not strange, the new translate performs two associative array lookups for each char of the input string).


import std.stdio, std.string, std.file, std.exception;

char[] version1InPlace(char[] text) {
    immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ";
    immutable t2 = "0111222333445566677788899901112223334455666777888999";
    immutable delchars = "\"";

    char[256] transtab;
    foreach (i, ref char c; transtab)
        c = cast(char)i;

    foreach (i, char c; t1)
        transtab[c] = t2[i];

    bool[256] deltab = false;
    foreach (char c; delchars)
        deltab[c] = true;

    int count;
    foreach (char c; text)
        if (!deltab[c]) {
            text[count] = transtab[c];
            count++;
        }

    return text[0 .. count];
}


char[] version2InPlace(char[] text) {
    immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ";
    immutable t2 = "0111222333445566677788899901112223334455666777888999";
    immutable delchars = "\"";

    char[256] transtab;
    foreach (i, ref char c; transtab)
        c = cast(char)i;

    foreach (i, char c; t1)
        transtab[c] = t2[i];

    int count;
    outer: foreach (char c; text) {
        foreach (char d; delchars)
            if (c == d)
                continue outer;
        text[count] = transtab[c];
        count++;
    }

    return text[0 .. count];
}


string version3(char[] text) {
    auto tab =
maketrans("ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ",

"0111222333445566677788899901112223334455666777888999");
    return text.translate(tab, "\"");
}


string version4(string text) {
    dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6',
    'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9',
    'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7',
    'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8',
    'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3',
    'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2',
    'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9'];
    return text.translate(aa, "\"");
}


void main() {
    // http://patriot.net/~bmcgin/kjv12.txt
    auto txt = cast(char[])read("kjv12.txt");

    assert(version1InPlace(txt.dup) == version2InPlace(txt.dup));
    assert(version1InPlace(txt.dup) == version3(txt.dup));
    assert(version1InPlace(txt.dup) == version4(txt.idup));

    static if (1) version1InPlace(txt); // 0.16 seconds
    static if (0) version2InPlace(txt); // 0.18 seconds
    static if (0) version3(txt); // 0.20 seconds
    static if (0) version4(assumeUnique(txt)); // 0.97 seconds
}



Jonathan M Davis:
> We're trying to not have any ASCII-only string stuff.

Around there is a large amount of data that is essentially text, but it's ASCII, it's not Unicode. Like biological data, text data coming out of instruments, etc. Handling it as binary data is not good.


A possibility is to keep both the old maketrans and translate functions, and un-deprecate them (maybe moving them in std.ascii?). Other ideas are welcome.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
March 06, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515


Jonathan M Davis <jmdavisProg@gmx.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jmdavisProg@gmx.com


--- Comment #1 from Jonathan M Davis <jmdavisProg@gmx.com> 2012-03-06 00:15:03 PST ---
https://github.com/D-Programming-Language/phobos/pull/478

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
March 06, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515



--- Comment #2 from bearophile_hugs@eml.cc 2012-03-06 04:13:54 PST ---
(In reply to comment #1)
> https://github.com/D-Programming-Language/phobos/pull/478

Thank you. Your patch seems nice.

Two suggestions:

1) If they don't already say it, then I suggest to add in the ddoc of the functions that the translation arrays have a length 128. This for both old D programmers and Python programmers.

2) This code in translate():

+    bool[128] remTable;
+
+    remTable[] = false;


I suggest to write it like this, that's shorter and avoids a double initialization:

+    bool[128] remTable = false;

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
March 06, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515



--- Comment #3 from Jonathan M Davis <jmdavisProg@gmx.com> 2012-03-06 15:27:13 PST ---
> 1) If they don't already say it, then I suggest to add in the ddoc of the functions that the translation arrays have a length 128. This for both old D programmers and Python programmers.

I don't know. It seems to me that that's an implementation detail. makeTrans doesn't even return a static array. It returns a dynamic one. So, as long as the array passed to translate comes from makeTrans (as it should), it's complete non-issue. If anything, I'd argue for creating a struct (e.g. TransTable) which just held the dynamic array without giving access to it so that no one _can_ screw with it (since screwing with it is just going to break code).

It seems to me that the only reason that makeTrans exists in the first place rather than just putting it in translate is so that you can reuse the translation table. No one should be messing with it or passing anything else to that version of translate. So, why does the length of the array matter to the user of makeTrans or translate?

And actually, I'm _really_ tempted to just go and change it so that it returns a struct which you can't mess with and is only created by makeTrans. That completely resolves any possible issues caused by someone passing the wrong array to translate.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
March 06, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515



--- Comment #4 from bearophile_hugs@eml.cc 2012-03-06 15:42:22 PST ---
(In reply to comment #3)

> It seems to me that the only reason that makeTrans exists in the first place rather than just putting it in translate is so that you can reuse the translation table. No one should be messing with it or passing anything else to that version of translate.

As several other string functions, the two functions we are talking about here come from Python2: http://docs.python.org/library/string.html#string-functions http://docs.python.org/library/string.html#string.translate


The Python docs tell the size of the table too:

string.translate(s, table[, deletechars])
    Delete all characters from s that are in deletechars (if present), and then
translate the characters using table, which must be a 256-character string
giving the translation for each character value, indexed by its ordinal. If
table is None, then only the character deletion step is performed.


In Python some times I have built the translation table manually. If makeTrans returns an opaque or privately defined struct this is not possible.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
March 07, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515



--- Comment #5 from Jonathan M Davis <jmdavisProg@gmx.com> 2012-03-06 16:06:35 PST ---
And _why_ would you ever need to build it manually? That's what makeTrans is for.

The docs would have to explain exactly how the array is laid out for it to be at all reasonable for the user to be screwing with it, and right now, they definitely treat it as completely opaque. And without a really good reason why the user would be screwing with the translation table on their own, I would definitely argue that having it be opaque is a huge benefit, since it stops a whole possible set of bugs caused by translate being passed an array which doesn't conform to what it requires.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
March 11, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515



--- Comment #6 from bearophile_hugs@eml.cc 2012-03-11 13:53:26 PDT ---
I have thought some more about this and I have changed my mind: there is no point to artificially restrict the usefulness of this function. So I suggest:

1) To not use a opaque struct, but leave the simple translation array.
2) To use a translation array of 256 items. So translate is usable for an
ubyte[] array input too (with a cast of the ubyte[] to char[] in the function
call if necessary).

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
March 11, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515



--- Comment #7 from Jonathan M Davis <jmdavisProg@gmx.com> 2012-03-11 14:22:41 PDT ---
If you want a truly generic mapping function which operates on ubyte and creates an array which holds all of the possible values of that type, then it makes no sense to be putting it in std.string. It should be more generic and be in std.array (though given that it wouldn't really make sense to do what makeTrans and translate are doing with anything larger than a ubyte, I question how good an idea that would lbe).

If what you care about is strings, then there is no point in having an array greater than 128 elements long, because there aren't any ASCII characters greater than that. Having 256 is pointless and wasteful, because there won't be any characters that large. Not to mention, the _whole_ reason that you were complaining about this in the first place was because the unicode-aware version of translate was slower than using maketrans/trnslate when dealing with pure ASCII, and using 128 elements is going to be faster.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
March 11, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515



--- Comment #8 from bearophile_hugs@eml.cc 2012-03-11 14:40:45 PDT ---
(In reply to comment #7)

Thank you for your answers.

> If you want a truly generic mapping function which operates on ubyte and creates an array which holds all of the possible values of that type, then it makes no sense to be putting it in std.string. It should be more generic and be in std.array (though given that it wouldn't really make sense to do what makeTrans and translate are doing with anything larger than a ubyte, I question how good an idea that would lbe).

I have not asked for a truly generic function for std.array. So this problem doesn't happen.


> If what you care about is strings, then there is no point in having an array greater than 128 elements long, because there aren't any ASCII characters greater than that. Having 256 is pointless and wasteful, because there won't be any characters that large. Not to mention, the _whole_ reason that you were complaining about this in the first place was because the unicode-aware version of translate was slower than using maketrans/trnslate when dealing with pure ASCII, and using 128 elements is going to be faster.

Now I have had to use translate() for a different job: I have a good amount of
8-bit ASCII text in my language to process
(http://en.wikipedia.org/wiki/Extended_ASCII ). It's not Unicode, and I don't
think I can convert it to Unicode. I have to use translate() on this text. A
128-elements translate() isn't enough.

The "waste" caused by using a 256 items array instead of a 128 items array is probably not significant.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
March 11, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515



--- Comment #9 from Jonathan M Davis <jmdavisProg@gmx.com> 2012-03-11 15:12:49 PDT ---
Okay. It seems to me that this is getting a bit ridiculous. Phobos doesn't support Extendend ASCII _anywhere_. Its support of ASCII-specific stuff is already minimal. So, to set up a string function to take a string which is _completely_ invalid as far as the rest of Phobos is concerned seems really off to me. All you have to do with an ASCII-only version is make sure that you don't pass it unicode characters. But with an Extended ASCII version, all of a sudden you have invalid unicode sequences. The only support that Phobos gives for non-unicode and non-ASCII encodings is std.encoding, and almost all of the ASCII-specific stuff is in std.ascii.

The _only_ reason that I see to support this is the fact that the implementation that we've had happens to.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
« First   ‹ Prev
1 2