Andrej Mitrovic
Posted in reply to bearophile
| On 6/21/11, bearophile <bearophileHUGS@lycos.com> wrote:
> After translating so much code have you seen some recurring patterns in your code conversion bugs?
A lot of C code is missing default labels. I've used the github 2.054 clone to compile all examples in order to catch missing labels at compile time. I've missed many of them before this feature became available.
Float/Doubles are initialized to NaN in D, however I don't know what they're initialized to in C. But it seems none of the examples had floating point variables initialized. This causes halts/exceptions in D, and without ddbg it's really hard to track down what went wrong. I don't think the stack traces helped me in this regard. Generally forgetting to initialize floats seems to be common in other C code as well.
C struct initializer syntax is used for static arrays of structs in the C code. I think this could easily be a source of bugs if you add or remove fields from a struct. I'm not sure if that syntax works in D but I've used explicit ctor calls instead, such as:
struct Foo { int x, y; }
Foo[5] foo = [Foo(1, 2), Foo(1, 2), Foo(1, 2), Foo(1, 2), Foo(1, 2)];
It's verbose but it's safer.
In some code a variable would be defined as an int, and returned as a short. There are other cases where it's defined as an int, and then casted to a ubyte (BYTE is an alias for ubyte there). I have to fix some of these.
I did notice an annoying thing with D, for example:
ubyte val = min(0, 128);
Error: cannot implicitly convert expression (min(0,128)) of type int to ubyte
The problem with this is I have to add a cast or a safer to!() call.
to!() can only do a runtime check:
ubyte val = to!ubyte(min(1024, 2048)); // overflow exception at runtime
And a cast is completely unsafe:
ubyte val = cast(ubyte)(min(1024, 2048)); // no error, bugs introduced!
Of course the compiler doesn't know what min does, only that it returns an int, so I don't know whether it could ever figure out that a function would never return a value that overflows the assignment variable (maybe this feature could be implemented for pure functions..).
C Globals become an instant source of bugs as soon as you fire up a
new thread. This is due to D's thread-local storage by default
feature. I like D's approach, and C globals should be replaced with
messaging or some other technique. I've used __gshared in various
modules to make the examples compile and work properly, but this
should be replaced.
This isn't unique to these examples. I've seen globals used everywhere
for messaging purposes in C code, e.g. some audio processing examples
in a different library.
A weird thing I've seen in the examples was a pointer to a static variable was passed to a newly spawned thread's function. And via this static variable the original thread and the work thread communicated together. How this can work is beyond me, but it can't work in D. Luckily D's spawn() function does offer protection for this kind of code.
These examples used implicit fallbacks a lot, with only a comment
explicitly mentioning that there is a fallback in play. I think it's
good that we've abandoned implicit fallbacks, they could easily cause
bugs.
But, on the other hand someone might get an error about an implicit
fallback and think they forgot to add a "break" when porting C code.
So they add that, but now they've introduced a bug because the
fallback was actually necessary. I had this happen to me in one
example because the fallback wasn't explicitly mentioned.
There are issues with calling format() and writef(). This is not just
for this translation project. Most C code uses some form of
printf/wsprintf to either store a string to a variable or print it
out. But the format specifiers often do not translate into D. I think
what is probably causing this is that C types are not the same as D
types, so when you see this in C code:
printf("Note Count %ld, (long) notescount);
This might not translate into D and you could easily get an exception at runtime (a pity that this doesn't get caught at compile-time). The root of the problem is that the format specifiers explicitly name the type, while you inadvertently change the actual type in D (maybe you were fixing a conversion bug), and then you randomly get an exception thrown at runtime.
The #1 problem with these exceptions is that you get no error line
whatsoever. You just get back "Invalid format specifier" or something
like that. This is of no help, and I always have to fire up ddbg or
manually hunt down each format specifier and replace it with D's safe
"%s", which works great.
The exceptions with no error line problem is common for most Phobos
functions, I absolutely hate getting back an error message such as:
std.exception.ErrnoException@std\stdio.d(286): Cannot open file
`blablabla' in mode `r' (No such file or directory)
No stack trace, no error line except the error line in Phobos. I *don't care* about error lines in Phobos, the error is in my code not in Phobos, and I really think this needs to improve.
Many functions assume ASCII or USC2, both of which are wrong to assume
these days. I have some examples I still have to fix, because I've
used code like:
string buffer;
TextOut(buffer.toUTF16z, buffer.length);
This is incorrect code. TextOut takes the number of characters, not
code units when issuing the call. The code will compile and run fine
for the majority of cases because most characters will fit into
wchar's code point, and you will have a 1to1 code-point to code-unit
ratio. But add a Japanese character and you will see garbage being
printed out to the screen. The correct workaround for this is to
either:
1) Store everything as a dchar[]:
dstring buffer;
TextOut(buffer.toUTF16z, buffer.length); // correct as long as
UTF never has more than 4 code units
2) Use some Phobos code-point counting function
string buffer;
TextOut(buffer.toUTF16z, std.utf.count(buffer));
The second option might cause performance issues, while using UTF32 dchars might cause excessive use of memory. Other solutions might be to wrap strings and cache the count of code points. It's a balancing act between memory and performance.
Some API functions take char* as a parameter when expecting some pointer to a data buffer which can but doesn't necessarily store text. I don't know why char* is used instead of void*, perhaps char* are more often used with these functions so they've decided to type it like that.
There are many cases of implicit 0 to void* casts, but D catches this at compile time.
A big issue with porting is that D's character arrays are always initialized to invalid values. This wreaks havoc in code, in particular with static arrays. For example:
char[100] buffer;
APIWriteToBuffer(buffer.ptr);
myDStringFunction(buffer[]); // <- buggy code
The problem is twofold. First, the API call might not work because it
could easily expect a null terminator to be present as a first
character:
char[100] buffer;
APIWriteToBuffer(buffer.ptr); // Might not work!
So I've had to do this:
char[100] buffer;
buffer[] = '\0'; // or buffer[0] = '\0', depending on the function
APIWriteToBuffer(buffer.ptr);
Second, after the call you will have a buffer that has some valid text stored into it, and then either null terminators to the length of the array (if you did buffer[] = '\0'), OR you will have a null terminator followed by a series of char.init invalid UTF values.
So when you finally call a D function with such a static array, it
could print out garbage or throw an exception:
char[100] buffer;
APIWriteToBuffer(buffer.ptr);
myDStringFunction(buffer[]); // what will this function do when
stumbles on '\0' or char.init?
Behind the scense that D function might in turn call a C function, so
continued from the last example:
char[100] buffer;
APIWriteToBuffer(buffer.ptr);
myDStringFunction(buffer[]);
void myDStringFunction(char[] buffer)
{
// huge mess, the API function might write nonsense like ['f',
'o', 'o', '\0', char.init, char.init..]
APIPrintToScreen(buffer.ptr, buffer.length);
}
So you have to do conversion and duplication all the time. You have to do:
char[100] buffer;
APIWriteToBuffer(buffer.ptr); // buffer might now be ['f', 'o',
'o', '\0', char.init, char.init..];
myDStringFunction(to!string(buffer.ptr.toStringz));
How it works: First, you need to treat the buffer as a C null-terminated char array, hence using .ptr. Then, you have to call a Phobos function so it can retrieve a string until the first null terminator.
But it gets worse when UTF is in play because in v2.043 conv.to doesn't know how to convert wchar*'s. I've made a request on github and I think this feature will be incorporated, but for now you can't do this:
wchar[100] buffer;
APIWriteToBuffer(buffer.ptr);
myDStringFunction(to!string(buffer.ptr)); // error, to!() can't
convert wchar*
So I've had to create my own function which converts a wchar* to wchar[], which can then be used for whatever purpose:
wchar[100] buffer;
APIWriteToBuffer(buffer.ptr);
myDStringFunction(to!string(fromWStringz(buffer.ptr)));
So generally static arrays of characters and string handling between C and D can be painful.
|