Jump to page: 1 2
Thread overview
String compare performance
Nov 27, 2010
bearophile
Nov 27, 2010
Ellery Newcomer
Nov 27, 2010
bearophile
Nov 27, 2010
spir
Nov 27, 2010
bearophile
Nov 27, 2010
Jonathan M Davis
Nov 28, 2010
Austin Hastings
Nov 28, 2010
Jonathan M Davis
Nov 28, 2010
Iain Buclaw
Nov 28, 2010
bearophile
Nov 28, 2010
Jonathan M Davis
Nov 28, 2010
bearophile
Nov 28, 2010
spir
Nov 28, 2010
bearophile
Nov 29, 2010
Michel Fortin
Dec 08, 2010
Bruno Medeiros
Dec 08, 2010
Brüno Mediocre
Dec 09, 2010
Bruno Medeiros
Dec 09, 2010
Daniel Gibson
November 27, 2010
While translating a common Python script to D, I have found something interesting, so I have reduced it to a little benchmark.

Below there is the reduced Python2 code (it uses Psyco), and a little program to generate some test data. The timing of the first D2 version is not good compared to the Python-Psyco program (the generation of the *300 array is a quick thing), so I have created two more D2 versions to show myself that D2 wasn't broken :-)

The reduced code looks like a syntetic benchmark, but it has about the same performance profile of a 60 lines long Python script (the original code was using xrange(0,len(...)-3,3)  instead of xrange(len(...)-3), but the situation doesn't change much).

Timings, dmd compiler, best of 4, seconds:
  D #1: 5.72
  Psy:  1.59
  D #2: 0.55
  D #3: 0.34


2.3 GHz CPU.
dmd 2.050, ldc based on DMD v1.057 and llvm 2.6, gdc head based on GCC 4.4.5.

--------------------------

# Python2 version
import psyco

def test(data):
    count = 0
    for i in xrange(len(data) - 3):
        codon = data[i : i + 3]
        if codon == "TAG" or codon == "TGA" or codon == "TAA":
            count += 1
    return count

def main():
    data = open("data.txt").read()
    data = data * 300
    print test(data)

psyco.full()
main()

--------------------------

To produce some test data:


# Python2 code
from itertools import islice

def rnd(ia = 3877, ic = 29573, im = 139968):
    seed = 42
    imf = float(im)
    while 1:
        seed = (seed * ia + ic) % im
        r = seed / imf
        yield "ACGT"[int(r * 4)]


def main():
    data = "".join(islice(rnd(), 0, 120000))
    file("data.txt", "w").write(data)

main()

--------------------------

A D2 traslation (I have used printf to reduce a lot the asm produced):


// D2 version #1
import std.file: read;
import std.c.stdio: printf;

int test(char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        char[] codon = data[i .. i + 3];
        if (codon == "TAG" || codon == "TGA" || codon == "TAA")
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}

--------------------------

// D2 version #2
import std.file: read;
import std.c.stdio: printf;

bool cmp3(char[] s1, string s2) {
    return s1[0] == s2[0] && s1[1] == s2[1] && s1[2] == s2[2];
}

int test(char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        char[] codon = data[i .. i + 3];
        if (cmp3(codon, "TAG") || cmp3(codon, "TGA") || cmp3(codon, "TAA"))
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}

--------------------------

// D2 version #3
import std.file: read;
import std.c.stdio: printf;

int test(char[] data) {
    int count;

    foreach (i; 0 ..  data.length - 3) {
        if (data[i] == 'T') {
            if (data[i+1] == 'A') {
                if (data[i+2] == 'G') {
                    count++;
                } else if (data[i+2] == 'A') {
                    count++;
                }
            } else if (data[i+1] == 'G') {
                if (data[i+2] == 'A')
                    count++;
            }
        }
    }

    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}

--------------------------

dmd -O -release -inline test1.d

_D5test14testFAaZi	comdat
L0:		sub	ESP,024h
		push	EBX
		xor	EBX,EBX
		push	EBP
		mov	EBP,030h[ESP]
		push	ESI
		xor	ESI,ESI
		add	EBP,0FFFFFFFDh
		push	EDI
		je	LA8
		mov	EDX,03Ch[ESP]
		mov	EAX,038h[ESP]
		mov	EDI,EDX
L22:		mov	ECX,offset FLAT:_D11TypeInfo_Aa6__initZ
		lea	EAX,3[EBX]
		sub	EAX,EBX
		push	ECX
		lea	EDX,[EDI][EBX]
		push	dword ptr FLAT:_DATA[0Ch]
		push	dword ptr FLAT:_DATA[08h]
		mov	020h[ESP],EAX
		mov	024h[ESP],EDX
		push	EDX
		push	EAX
		call	near ptr __adEq2
		add	ESP,014h
		test	EAX,EAX
		jne	L9E
		mov	ECX,offset FLAT:_D11TypeInfo_Aa6__initZ
		push	ECX
		push	dword ptr FLAT:_DATA[01Ch]
		push	dword ptr FLAT:_DATA[018h]
		push	dword ptr 024h[ESP]
		push	dword ptr 024h[ESP]
		call	near ptr __adEq2
		add	ESP,014h
		test	EAX,EAX
		jne	L9E
		mov	EDX,offset FLAT:_D11TypeInfo_Aa6__initZ
		push	EDX
		push	dword ptr FLAT:_DATA[02Ch]
		push	dword ptr FLAT:_DATA[028h]
		push	dword ptr 024h[ESP]
		push	dword ptr 024h[ESP]
		call	near ptr __adEq2
		add	ESP,014h
		test	EAX,EAX
		je	L9F
L9E:		inc	ESI
L9F:		inc	EBX
		cmp	EBX,EBP
		jb	L22
LA8:		pop	EDI
		mov	EAX,ESI
		pop	ESI
		pop	EBP
		pop	EBX
		add	ESP,024h
		ret	8

--------------------------

dmd -O -release -inline test2.d

_D5test24testFAaZi	comdat
		sub	ESP,03Ch
		mov	ECX,040h[ESP]
		push	EBX
		push	EBP
		push	ESI
		xor	ESI,ESI
		push	EDI
		xor	EDI,EDI
		add	ECX,0FFFFFFFDh
		mov	01Ch[ESP],ECX
		je	LC5
		mov	EDX,054h[ESP]
		mov	EAX,050h[ESP]
		mov	EBP,EDX
L26:		lea	EBX,3[ESI]
		sub	EBX,ESI
		lea	ECX,[EBP][ESI]
		mov	028h[ESP],ECX
		mov	DL,[ECX]
		mov	ECX,FLAT:_DATA[0Ch]
		mov	024h[ESP],EBX
		mov	EAX,FLAT:_DATA[08h]
		cmp	DL,[ECX]
		mov	010h[ESP],ECX
		jne	L65
		mov	EDX,028h[ESP]
		mov	EAX,EBX
		mov	EBX,010h[ESP]
		mov	CL,1[EDX]
		cmp	CL,1[EBX]
		jne	L65
		mov	DL,2[EDX]
		cmp	DL,2[EBX]
		je	LB9
L65:		mov	EDX,028h[ESP]
		mov	CL,[EDX]
		mov	018h[ESP],EDX
		mov	EAX,024h[ESP]
		mov	EDX,FLAT:_DATA[01Ch]
		cmp	CL,[EDX]
		mov	EAX,FLAT:_DATA[018h]
		jne	L96
		mov	EBX,018h[ESP]
		mov	AL,1[EBX]
		cmp	AL,1[EDX]
		jne	L96
		mov	AL,2[EBX]
		cmp	AL,2[EDX]
		je	LB9
L96:		mov	EDX,FLAT:_DATA[02Ch]
		mov	EAX,FLAT:_DATA[028h]
		cmp	CL,[EDX]
		jne	LBA
		mov	ECX,018h[ESP]
		mov	BL,1[ECX]
		cmp	BL,1[EDX]
		jne	LBA
		mov	CL,2[ECX]
		cmp	CL,2[EDX]
		jne	LBA
LB9:		inc	EDI
LBA:		inc	ESI
		cmp	ESI,01Ch[ESP]
		jb	L26
LC5:		mov	EAX,EDI
		pop	EDI
		pop	ESI
		pop	EBP
		pop	EBX
		add	ESP,03Ch
		ret	8

--------------------------

dmd -O -release -inline test3.d

_D5test34testFAaZi	comdat
		push	EAX
		xor	EDX,EDX
		push	EBX
		xor	EBX,EBX
		push	ESI
		mov	ESI,010h[ESP]
		add	ESI,0FFFFFFFDh
		je	L4A
		mov	8[ESP],EDX
		mov	EDX,014h[ESP]
		mov	ECX,EDX
		mov	EAX,010h[ESP]
		mov	EDX,8[ESP]
L22:		cmp	[ECX][EDX],054h
		jne	L45
		mov	AL,1[ECX][EDX]
		cmp	AL,041h
		jne	L39
		cmp	byte ptr 2[ECX][EDX],047h
		je	L44
		jmp short	L3D
L39:		cmp	AL,047h
		jne	L45
L3D:		cmp	byte ptr 2[ECX][EDX],041h
		jne	L45
L44:		inc	EBX
L45:		inc	EDX
		cmp	EDX,ESI
		jb	L22
L4A:		pop	ESI
		mov	EAX,EBX
		pop	EBX
		pop	ECX
		ret	8

--------------------------

gdc -O3 -frelease -inline -S test1.d -o test1.s

_D5test14testFAaZi:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$12, %esp
	movl	8(%ebp), %eax
	movl	12(%ebp), %edx
	movl	$0, -24(%ebp)
	subl	$3, %eax
	movl	%edx, -20(%ebp)
	movl	%eax, -16(%ebp)
	je	.L5
	xorl	%edx, %edx
	jmp	.L8
	.p2align 4,,7
	.p2align 3
.L6:
	addl	$1, -24(%ebp)
	addl	$1, %edx
	cmpl	%edx, -16(%ebp)
	jbe	.L5
.L8:
	movl	-20(%ebp), %eax
	movl	$.LC0, %edi
	movl	$3, %ecx
	addl	%edx, %eax
	movl	%eax, %esi
	repz cmpsb
	je	.L6
	movl	%eax, %esi
	movl	$.LC1, %edi
	movl	$3, %ecx
	repz cmpsb
	je	.L6
	movl	$.LC2, %edi
	movl	%eax, %esi
	movl	$3, %ecx
	repz cmpsb
	je	.L6
	addl	$1, %edx
	cmpl	%edx, -16(%ebp)
	ja	.L8
.L5:
	movl	-24(%ebp), %eax
	addl	$12, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret

--------------------------------------

gdc -O3 -frelease -inline -S test2.d -o test2.s

_D5test24testFAaZi:
	pushl	%ebp
	xorl	%edx, %edx
	movl	%esp, %ebp
	xorl	%eax, %eax
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$16, %esp
	movl	8(%ebp), %ebx
	movl	12(%ebp), %esi
	movl	%ebx, %edi
	subl	$3, %edi
	jne	.L18
	jmp	.L9
	.p2align 4,,7
	.p2align 3
.L10:
	movl	-16(%ebp), %ecx
	movzbl	.LC1, %ebx
	cmpb	(%ecx), %bl
	je	.L15
.L12:
	movzbl	.LC2, %ebx
	cmpb	(%ecx), %bl
	je	.L21
.L13:
	addl	$1, %edx
	cmpl	%edx, %edi
	jbe	.L9
	.p2align 4,,7
	.p2align 3
.L18:
	leal	(%esi,%edx), %ecx
	movzbl	.LC0, %ebx
	movl	$3, -20(%ebp)
	movl	%ecx, -16(%ebp)
	cmpb	(%ecx), %bl
	jne	.L10
	movzbl	.LC0+1, %ebx
	cmpb	1(%ecx), %bl
	jne	.L10
	movzbl	.LC0+2, %ebx
	cmpb	2(%ecx), %bl
	jne	.L10
.L11:
	addl	$1, %edx
	addl	$1, %eax
	cmpl	%edx, %edi
	ja	.L18
.L9:
	addl	$16, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
	.p2align 4,,7
	.p2align 3
.L15:
	movzbl	.LC1+1, %ebx
	cmpb	1(%ecx), %bl
	jne	.L12
	movzbl	.LC1+2, %ebx
	cmpb	2(%ecx), %bl
	je	.L11
	movzbl	.LC2, %ebx
	cmpb	(%ecx), %bl
	jne	.L13
	.p2align 4,,7
	.p2align 3
.L21:
	movzbl	.LC2+1, %ebx
	cmpb	1(%ecx), %bl
	jne	.L13
	movzbl	.LC2+2, %ebx
	cmpb	2(%ecx), %bl
	jne	.L13
	jmp	.L11

--------------------------------------

gdc -O3 -frelease -inline -S test3.d -o test3.s

_D5test34testFAaZi:
	pushl	%ebp
	xorl	%eax, %eax
	movl	%esp, %ebp
	pushl	%esi
	movl	8(%ebp), %esi
	pushl	%ebx
	movl	12(%ebp), %ebx
	subl	$3, %esi
	je	.L6
	movl	$1, %edx
	jmp	.L7
	.p2align 4,,7
	.p2align 3
.L3:
	cmpl	%edx, %esi
	leal	1(%edx), %ecx
	jbe	.L6
.L11:
	movl	%ecx, %edx
.L7:
	cmpb	$84, -1(%ebx,%edx)
	jne	.L3
	movzbl	(%ebx,%edx), %ecx
	cmpb	$65, %cl
	je	.L10
	cmpb	$71, %cl
	jne	.L3
	xorl	%ecx, %ecx
	cmpb	$65, 1(%ebx,%edx)
	sete	%cl
	addl	%ecx, %eax
	cmpl	%edx, %esi
	leal	1(%edx), %ecx
	ja	.L11
	.p2align 4,,7
	.p2align 3
.L6:
	popl	%ebx
	popl	%esi
	popl	%ebp
	ret

---------------------------

ldc -O3 -release -inline -output-s test1.d

_D5test14testFAaZi:
	pushl	%ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	subl	$20, %esp
	movl	40(%esp), %esi
	cmpl	$3, %esi
	je	.LBB1_8
	addl	$4294967293, %esi
	xorl	%ebx, %ebx
	movl	%ebx, %edi
	.align	16
.LBB1_2:
	movl	44(%esp), %eax
	leal	(%eax,%ebx), %ebp
	movl	%ebp, 4(%esp)
	movl	$_D11TypeInfo_Aa6__initZ, 16(%esp)
	movl	$.str, 12(%esp)
	movl	$3, 8(%esp)
	movl	$3, (%esp)
	call	_adEq
	testl	%eax, %eax
	jne	.LBB1_5
	movl	%ebp, 4(%esp)
	movl	$_D11TypeInfo_Aa6__initZ, 16(%esp)
	movl	$.str1, 12(%esp)
	movl	$3, 8(%esp)
	movl	$3, (%esp)
	call	_adEq
	testl	%eax, %eax
	jne	.LBB1_5
	movl	%ebp, 4(%esp)
	movl	$_D11TypeInfo_Aa6__initZ, 16(%esp)
	movl	$.str2, 12(%esp)
	movl	$3, 8(%esp)
	movl	$3, (%esp)
	call	_adEq
	testl	%eax, %eax
	je	.LBB1_6
.LBB1_5:
	incl	%edi
.LBB1_6:
	incl	%ebx
	cmpl	%esi, %ebx
	jb	.LBB1_2
.LBB1_7:
	movl	%edi, %eax
	addl	$20, %esp
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	ret	$8
.LBB1_8:
	xorl	%edi, %edi
	jmp	.LBB1_7

---------------------------

ldc -O3 -release -inline -output-s test2.d

_D5test24testFAaZi:
	pushl	%ebx
	pushl	%esi
	movl	12(%esp), %ecx
	cmpl	$3, %ecx
	je	.LBB2_14
	movl	16(%esp), %edx
	addl	$4294967293, %ecx
	xorl	%eax, %eax
	movl	%eax, %esi
	.align	16
.LBB2_2:
	movb	(%edx,%esi), %bl
	cmpb	$84, %bl
	jne	.LBB2_12
	movb	1(%edx,%esi), %bh
	cmpb	$65, %bh
	jne	.LBB2_5
	cmpb	$71, 2(%esi,%edx)
	je	.LBB2_11
.LBB2_5:
	cmpb	$84, %bl
	jne	.LBB2_12
	cmpb	$71, %bh
	jne	.LBB2_8
	cmpb	$65, 2(%esi,%edx)
	je	.LBB2_11
.LBB2_8:
	cmpb	$65, %bh
	setne	%bh
	cmpb	$84, %bl
	jne	.LBB2_12
	testb	%bh, %bh
	jne	.LBB2_12
	cmpb	$65, 2(%esi,%edx)
	jne	.LBB2_12
.LBB2_11:
	incl	%eax
.LBB2_12:
	incl	%esi
	cmpl	%ecx, %esi
	jb	.LBB2_2
.LBB2_13:
	popl	%esi
	popl	%ebx
	ret	$8
.LBB2_14:
	xorl	%eax, %eax
	jmp	.LBB2_13

---------------------------

ldc -O3 -release -inline -output-s test3.d

_D5test34testFAaZi:
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	16(%esp), %ecx
	cmpl	$3, %ecx
	je	.LBB1_12
	movl	20(%esp), %edx
	addl	$4294967293, %ecx
	xorl	%eax, %eax
	movl	%eax, %esi
	.align	16
.LBB1_2:
	cmpb	$84, (%edx,%esi)
	jne	.LBB1_10
	movb	1(%edx,%esi), %bl
	cmpb	$71, %bl
	je	.LBB1_8
	cmpb	$65, %bl
	jne	.LBB1_10
	movb	2(%esi,%edx), %bl
	cmpb	$71, %bl
	jne	.LBB1_7
	incl	%eax
	jmp	.LBB1_10
.LBB1_7:
	cmpb	$65, %bl
	jmp	.LBB1_9
.LBB1_8:
	cmpb	$65, 2(%esi,%edx)
.LBB1_9:
	sete	%bl
	movzbl	%bl, %edi
	addl	%edi, %eax
.LBB1_10:
	incl	%esi
	cmpl	%ecx, %esi
	jb	.LBB1_2
.LBB1_11:
	popl	%esi
	popl	%edi
	popl	%ebx
	ret	$8
.LBB1_12:
	xorl	%eax, %eax
	jmp	.LBB1_11

---------------------------

Bye,
bearophile
November 27, 2010
how well does dmd perform when you use a switch statement?
November 27, 2010
Ellery Newcomer:

> how well does dmd perform when you use a switch statement?

4th D version:

import std.file: read;
import std.c.stdio: printf;

int test(char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        switch (data[i .. i + 3]) {
            case "TAG", "TGA", "TAA":
                count++;
            default:
        }
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}



Timings, dmd compiler, best of 4, seconds:
  D #1: 5.72
  D #4: 1.84
  Psy:  1.59
  D #2: 0.55
  D #3: 0.34

Bye,
bearophile
November 27, 2010
On Sat, 27 Nov 2010 11:53:06 -0500
bearophile <bearophileHUGS@lycos.com> wrote:

> Timings, dmd compiler, best of 4, seconds:
>   D #1: 5.72
>   Psy:  1.59
>   D #2: 0.55
>   D #3: 0.34
> 
> 
> 2.3 GHz CPU.
> dmd 2.050, ldc based on DMD v1.057 and llvm 2.6, gdc head based on GCC 4.4.5.
> 
> --------------------------
> 
> # Python2 version
> import psyco
> 
> def test(data):
>     count = 0
>     for i in xrange(len(data) - 3):
>         codon = data[i : i + 3]
>         if codon == "TAG" or codon == "TGA" or codon == "TAA":
>             count += 1
>     return count
> 
> def main():
>     data = open("data.txt").read()
>     data = data * 300
>     print test(data)
> 
> psyco.full()
> main()
> 
> --------------------------
> 
> To produce some test data:
> 
> 
> # Python2 code
> from itertools import islice
> 
> def rnd(ia = 3877, ic = 29573, im = 139968):
>     seed = 42
>     imf = float(im)
>     while 1:
>         seed = (seed * ia + ic) % im
>         r = seed / imf
>         yield "ACGT"[int(r * 4)]
> 
> 
> def main():
>     data = "".join(islice(rnd(), 0, 120000))
>     file("data.txt", "w").write(data)
> 
> main()
> 
> --------------------------
> 
> A D2 traslation (I have used printf to reduce a lot the asm produced):
> 
> 
> // D2 version #1
> import std.file: read;
> import std.c.stdio: printf;
> 
> int test(char[] data) {
>     int count;
>     foreach (i; 0 ..  data.length - 3) {
>         char[] codon = data[i .. i + 3];
>         if (codon == "TAG" || codon == "TGA" || codon == "TAA")
>             count++;
>     }
>     return count;
> }
> 
> void main() {
>     char[] data0 = cast(char[])read("data.txt");
>     int n = 300;
>     char[] data = new char[data0.length * n];
>     for (size_t pos; pos < data.length; pos += data0.length)
>         data[pos .. pos+data0.length] = data0;
> 
>     printf("%d\n", test(data));
> }
> 
> --------------------------
> 
> // D2 version #2
> import std.file: read;
> import std.c.stdio: printf;
> 
> bool cmp3(char[] s1, string s2) {
>     return s1[0] == s2[0] && s1[1] == s2[1] && s1[2] == s2[2];
> }
> 
> int test(char[] data) {
>     int count;
>     foreach (i; 0 ..  data.length - 3) {
>         char[] codon = data[i .. i + 3];
>         if (cmp3(codon, "TAG") || cmp3(codon, "TGA") || cmp3(codon, "TAA"))
>             count++;
>     }
>     return count;
> }
> 
> void main() {
>     char[] data0 = cast(char[])read("data.txt");
>     int n = 300;
>     char[] data = new char[data0.length * n];
>     for (size_t pos; pos < data.length; pos += data0.length)
>         data[pos .. pos+data0.length] = data0;
> 
>     printf("%d\n", test(data));
> }

Impressive!
What's your diagnostic on why basic string compare parforms the way it does?
And have you tried a naive hand-written strComp (just for the sake of completeness) like:

    bool strComp (string s1, string s2) {
        auto l = s1.length;
        if (l != s2.length)
            return false;
        foreach (uint i ; 0..l)
            if (s1[i] != s2[i])
                return false;
        return true;
    }

(I bet this will perform at about the same order of magnitude as your version D#1.)
Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?


Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

November 27, 2010
denis.spir

> What's your diagnostic on why basic string compare parforms the way it does?

I prefer the numbers and asm listings speak for themselves, because my diagnostic sometimes is wrong. But I guess it's caused by three not inlined calls to __adEq2 each loop, done to compare 3 char long arrays.


> And have you tried a naive hand-written strComp (just for the sake of completeness) like:

Here it is:

Timings, dmd compiler, best of 4, seconds:
  D #1: 5.72
  D #4: 1.84
  D #5: 1.73
  Psy:  1.59
  D #2: 0.55
  D #3: 0.34


// D version #5
import std.file: read;
import std.c.stdio: printf;

bool cmp2(char[] s1, string s2) {
    if (s1.length != s2.length)
        return false;
    foreach (uint i ; 0 .. s1.length)
        if (s1[i] != s2[i])
            return false;
    return true;
}

int test(char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        char[] codon = data[i .. i + 3];
        if (cmp2(codon, "TAG") || cmp2(codon, "TGA") || cmp2(codon, "TAA"))
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}


> Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?

I don't know. I think you can't.

Bye,
bearophile
November 27, 2010
On Saturday 27 November 2010 14:46:00 bearophile wrote:
> denis.spir
> > Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?
> 
> I don't know. I think you can't.

You might be able to do that for primitive types and for structs with no opEquals() and with no member variables with an opEquals(), but you certainly can't do it in the general case, since in the general case, you'd have to be calling opEquals() on each element individually.

- Jonathan M Davis
November 28, 2010
On 11/27/2010 11:53 AM, bearophile wrote:
> While translating a common Python script to D, I have found something interesting, so I have reduced it to a little benchmark.
>
> Below there is the reduced Python2 code (it uses Psyco), and a little program to generate some test data. The timing of the first D2 version is not good compared to the Python-Psyco program (the generation of the *300 array is a quick thing), so I have created two more D2 versions to show myself that D2 wasn't broken :-)
>
> The reduced code looks like a syntetic benchmark, but it has about the same performance profile of a 60 lines long Python script (the original code was using xrange(0,len(...)-3,3)  instead of xrange(len(...)-3), but the situation doesn't change much).
>

It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default".

If it is the former, a state machine is probably the right answer for long input strings.

start -> E

E A,C,G:-> E
E T:-> T

T A:-> TA
T C:-> E
T G:-> TG
T T:-> T

TA A,G:-> E, {++count}
TA C:-> E
TA T:-> T

TG A:-> E, {++count}
TG C,G:-> E
TG T:-> T


Since the probability of failure is about .75, 0.5, {.5 or .75}, the overall comparison cost is about 1 + .25 + .125 = 1.375, times compare+loop times n, plus all the function call overhead, etc.

Using a DFA should reduce the cost to 1 * index, times n, assuming you are using a small alphabet (I'm guessing that your codons are ascii+uppercase only, table-size = 26).

=Austin
November 28, 2010
On Saturday 27 November 2010 22:13:39 Austin Hastings wrote:
> On 11/27/2010 11:53 AM, bearophile wrote:
> > While translating a common Python script to D, I have found something interesting, so I have reduced it to a little benchmark.
> > 
> > Below there is the reduced Python2 code (it uses Psyco), and a little program to generate some test data. The timing of the first D2 version is not good compared to the Python-Psyco program (the generation of the *300 array is a quick thing), so I have created two more D2 versions to show myself that D2 wasn't broken :-)
> > 
> > The reduced code looks like a syntetic benchmark, but it has about the
> > same performance profile of a 60 lines long Python script (the original
> > code was using xrange(0,len(...)-3,3)  instead of xrange(len(...)-3),
> > but the situation doesn't change much).
> 
> It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default".


What I think is pretty clear from this is that at some point, some work needs to be done to optimize array comparisons when memcmp can be used instead of having to call the individual opEquals() of each element. It's not the sort of thing that's likely to be a priority, but it really should be done at some point.

- Jonathan M Davis
November 28, 2010
On Saturday 27 November 2010 22:23:39 Jonathan M Davis wrote:
> On Saturday 27 November 2010 22:13:39 Austin Hastings wrote:
> > On 11/27/2010 11:53 AM, bearophile wrote:
> > > While translating a common Python script to D, I have found something interesting, so I have reduced it to a little benchmark.
> > > 
> > > Below there is the reduced Python2 code (it uses Psyco), and a little program to generate some test data. The timing of the first D2 version is not good compared to the Python-Psyco program (the generation of the *300 array is a quick thing), so I have created two more D2 versions to show myself that D2 wasn't broken :-)
> > > 
> > > The reduced code looks like a syntetic benchmark, but it has about the
> > > same performance profile of a 60 lines long Python script (the original
> > > code was using xrange(0,len(...)-3,3)  instead of xrange(len(...)-3),
> > > but the situation doesn't change much).
> > 
> > It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default".
> 
> What I think is pretty clear from this is that at some point, some work needs to be done to optimize array comparisons when memcmp can be used instead of having to call the individual opEquals() of each element. It's not the sort of thing that's likely to be a priority, but it really should be done at some point.

Enhancement Request: http://d.puremagic.com/issues/show_bug.cgi?id=5282
November 28, 2010
Austin Hastings:

> It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default".

The third D version in my post is fast enough, so the point of that post is that a very common string compare operation, done in the natural way, is a bit too much slow in D/DMD.


> If it is the former, a state machine is probably the right answer for long input strings.

This case is very simple, so a simple tree (as in the third D version of my post) is probably good enough.

In some similar but more complex situations of genomic processing I've seen that in C it's good to have computed gotos, to implement faster state machines. See my recent post "explicitly unimplemented computed gotos" :-) This was my use case beside the desire to create interpreters in D (it's a similar thing).


> Since the probability of failure is about .75, 0.5, {.5 or .75}, the overall comparison cost is about 1 + .25 + .125 = 1.375, times compare+loop times n, plus all the function call overhead, etc.

All 3 codons start with T, so if you test that first letter before the 3 string compare you speed up the original code a lot. But I was looking for a more general solution. Implementing a state machine is good, but in this case I'd like the compiler to do more optimizations by itself. A simple but good enough optimization for the compiler is to replace tests with strings (known to be short at compile-time) with inlined comparisons of their single chars.


> Using a DFA should reduce the cost to 1 * index, times n, assuming you are using a small alphabet (I'm guessing that your codons are ascii+uppercase only, table-size = 26).

In this case the alphabet was just 4 symbols: {'A', 'C', 'G', 'T'} (with no undecided or modified nucleosides).

Bye,
bearophile
« First   ‹ Prev
1 2