maketrans and translate

Feb 13, 2012

bearophile

Feb 13, 2012

Stewart Gordon

Feb 13, 2012

Feb 13, 2012

Feb 13, 2012

Feb 13, 2012

In the online docs I've seen that std.string.maketrans() is (going to be) deprecated. How do you adapt this code to the new regime? import std.stdio, std.string; void main() { char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII text auto tab = maketrans("ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ", "0111222333445566677788899901112223334455666777888999"); auto text2 = text.translate(tab, "\""); writeln(text2); } Bye, bearophile

On 13/02/2012 03:04, bearophile wrote: > In the online docs I've seen that std.string.maketrans() is (going to be) deprecated. > How do you adapt this code to the new regime? <snip> Use an associative array for the translation table. Or write your own functions that work in the same way as maketrans/translate. Stewart.

Stewart Gordon: > Use an associative array for the translation table. A new version of the code: import std.stdio, std.string, std.range; void main() { char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII text dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6', 'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9', 'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7', 'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8', 'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3', 'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2', 'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9']; auto text3 = text.translate(aa, "\""); writeln(text3); } > Or write your own functions that work in the same way as maketrans/translate. The code with the associative array is longer, and every char in the original string requires one or two associative array lookups. This is quite inefficient for ASCII strings. Instead of writing my own string functions, or copying them from an older Phobos, what about the idea of keeping both the old maketrans and translate functions too and to un-deprecate them (maybe moving them in std.ascii)? They are not UTF-safe, but I have large amounts of biological data text that is not Unicode. Bye, bearophile

On Monday, February 13, 2012 08:09:06 bearophile wrote: > Stewart Gordon: > > Use an associative array for the translation table. > > A new version of the code: > > import std.stdio, std.string, std.range; > void main() { > char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII > text dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', > 'c':'6', 'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', > 'g':'9', 'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', > 'k':'7', 'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', > 'o':'8', 'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', > 's':'3', 'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', > 'w':'2', 'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9']; > auto text3 = text.translate(aa, "\""); > writeln(text3); > } > > > Or write your own functions that work in the same way as maketrans/translate. > The code with the associative array is longer, and every char in the original string requires one or two associative array lookups. This is quite inefficient for ASCII strings. > > Instead of writing my own string functions, or copying them from an older Phobos, what about the idea of keeping both the old maketrans and translate functions too and to un-deprecate them (maybe moving them in std.ascii)? They are not UTF-safe, but I have large amounts of biological data text that is not Unicode. Do you have data to backup that there is a significant speed difference? We're trying to not have any ASCII-only string stuff. - Jonathan M Davis

February 13, 2012

Re: maketrans and translate

Posted by bearophile
in reply to Jonathan M Davis

Permalink

bearophile

Posted in reply to Jonathan M Davis

Permalink

Jonathan M Davis:

> Do you have data to backup that there is a significant speed difference?

I have just written a small benchmark, it contains both the URL to the test data and the timings I am seeing on a slow PC. If your PC is faster feel free to use it on more data (creating a larger input file on disk, concatenating many copies of the same test data). If you see bugs in the benchmark please tell me.

version3 uses the old translate. version4 uses the new translate. version1InPlace is a reference point, it works in-place. version2InPlace is similar, uses a linear scan for the chars to remove, and it seems a bit slower than version1InPlace (probably because the small deltab array fits well in the L1 cache).

If both the benchmark and the timings are correct, the old translate seems almost 5 times faster than the new one (and this is not strange, the new translate performs two associative array lookups for each char of the input string).


import std.stdio, std.string, std.file, std.exception;

char[] version1InPlace(char[] text) {
    immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ";
    immutable t2 = "0111222333445566677788899901112223334455666777888999";
    immutable delchars = "\"";

    char[256] transtab;
    foreach (i, ref char c; transtab)
        c = cast(char)i;

    foreach (i, char c; t1)
        transtab[c] = t2[i];

    bool[256] deltab = false;
    foreach (char c; delchars)
        deltab[c] = true;

    int count;
    foreach (char c; text)
        if (!deltab[c]) {
            text[count] = transtab[c];
            count++;
        }

    return text[0 .. count];
}


char[] version2InPlace(char[] text) {
    immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ";
    immutable t2 = "0111222333445566677788899901112223334455666777888999";
    immutable delchars = "\"";

    char[256] transtab;
    foreach (i, ref char c; transtab)
        c = cast(char)i;

    foreach (i, char c; t1)
        transtab[c] = t2[i];

    int count;
    outer: foreach (char c; text) {
        foreach (char d; delchars)
            if (c == d)
                continue outer;
        text[count] = transtab[c];
        count++;
    }

    return text[0 .. count];
}


string version3(char[] text) {
    auto tab = maketrans("ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ",
                         "0111222333445566677788899901112223334455666777888999");
    return text.translate(tab, "\"");
}


string version4(string text) {
    dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6',
    'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9',
    'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7',
    'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8',
    'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3',
    'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2',
    'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9'];
    return text.translate(aa, "\"");
}


void main() {
    // http://patriot.net/~bmcgin/kjv12.txt
    auto txt = cast(char[])read("kjv12.txt");

    assert(version1InPlace(txt.dup) == version2InPlace(txt.dup));
    assert(version1InPlace(txt.dup) == version3(txt.dup));
    assert(version1InPlace(txt.dup) == version4(txt.idup));

    static if (1) version1InPlace(txt); // 0.16 seconds
    static if (0) version2InPlace(txt); // 0.18 seconds
    static if (0) version3(txt); // 0.20 seconds
    static if (0) version4(assumeUnique(txt)); // 0.97 seconds
}


> We're trying to not have any ASCII-only string stuff.

Around there is a large amount of data that is essentially text, but it's ASCII, it's not Unicode. Like biological data, text data coming out of instruments, etc. Handling it as binary data is not good.

Bye,
bearophile

On Monday, February 13, 2012 14:08:16 bearophile wrote: > Jonathan M Davis: > > Do you have data to backup that there is a significant speed difference? > > I have just written a small benchmark, it contains both the URL to the test data and the timings I am seeing on a slow PC. If your PC is faster feel free to use it on more data (creating a larger input file on disk, concatenating many copies of the same test data). If you see bugs in the benchmark please tell me. > > version3 uses the old translate. version4 uses the new translate. version1InPlace is a reference point, it works in-place. version2InPlace is similar, uses a linear scan for the chars to remove, and it seems a bit slower than version1InPlace (probably because the small deltab array fits well in the L1 cache). > > If both the benchmark and the timings are correct, the old translate seems almost 5 times faster than the new one (and this is not strange, the new translate performs two associative array lookups for each char of the input string). I'll look into the best way to handle it. Those functions aren't being deprecated this release regardless. > > We're trying to not have any ASCII-only string stuff. > > Around there is a large amount of data that is essentially text, but it's ASCII, it's not Unicode. Like biological data, text data coming out of instruments, etc. Handling it as binary data is not good. In most cases, it should make no difference. You could use 1, 2, 3, and 4 as easily as you could use A, T, C, and G. Most algorithms don't care one whit whether you're dealing with characters or numbers. The only problem that I see is cases where there's an algorithm which would normally only be done with characters and not numbers and which therefore is in std.string and not std.array or std.algorithm - and translate falls into that category. Regardless, in general, Phobos is not going to have ASCII-specific string processing. - Jonathan M Davis

Forums