std.string.maketrans and std.string.translate not unicode aware

June 30, 2004

Posted by Sam McCall

Permalink

Sam McCall

Permalink

The std.string.maketrans and translate functions are meant to create and apply a character translation table, respectively. However at the moment they create and apply a byte translation table.
This will cause translation errors, and assertions if you try and replace an ASCII character with a non-ASCII character, for example, due to different array lengths.
Unfortunately it's not possible to fix this without changing the function signatures, the lookup table would be too big. Something like this should work...
Sam

/************************************
 * Construct translation table for translate().
 */

dchar[dchar] maketrans(dchar[] from, dchar[] to)
    in
    {
	assert(from.length == to.length);
    }
    body
    {
	dchar[dchar] t;

	for (int i=0; i<from.length; i++)
	    t[from[i]] = to[i];

	return t;
    }

/******************************************
 * Translate characters in s[] using table created by maketrans().
 * Delete chars in delchars[].
 */

dchar[] translate(dchar[] s, dchar[dchar] transtab, dchar[] delchars) {
	dchar[] r;
	int i;
	int count;
	bit[dchar] deltab;

	for (i = 0; i < delchars.length; i++)
	    deltab[delchars[i]] = true;

	count = 0;
	foreach(dchar d; s)
		if(!(d in deltab))
			count++;

	r = new dchar[count];
	count = 0;
	foreach(dchar d; s)
		if(!(d in deltab))
			r[count++]=transtab[d];

	return r;
}


/******************************************
 * Translate characters in s[] using table created by maketrans().
 * Delete chars in delchars[].
 */

char[] translate(char[] s, dchar[dchar] transtab, dchar[] delchars) {
	dchar[] r;
	int i;
	int count;
	bit[dchar] deltab;

	for (i = 0; i < delchars.length; i++)
	    deltab[delchars[i]] = true;

	count = 0;
	foreach(dchar d; s)	// iterates properly over characters
		if(!(d in deltab))
			count++;

	r = new dchar[count];
	count = 0;
	foreach(dchar d; s)
		if(!(d in deltab))
			r[count++]=transtab[d];

	return toUTF8(r);
}

In article <cbttmg$1u68$1@digitaldaemon.com>, Sam McCall says... > >The std.string.maketrans and translate functions are meant to create and apply a character translation table, respectively. However at the moment they create and apply a byte translation table. In agreement with Sam here, but I should point out that the bug is actually much more serious than Sam suggests. It's not just a matter of missing features - it's a matter of serious UTF-8 corruption. The current implementation allow users to modify char values in the range 0x80 to 0xFF. These bytes have specific meaning in terms of UTF-8. Allowing users to modify such values with a translate() routine is DANGEROUS, and is pretty much guaranteed to result in a string containing invalid UTF-8. Sam suggests a number of ways of making these functions dchar-based instead of char-based. But if you want to keep them char-based, then you absolutely must disallow the modification of any char >0x7F, and document such functions as ASCII-only. Arcane Jill

Forums