Empty string vs null

I have just discovered that D seems to treat empty and null strings as the same thing:

// test.d
import std.stdio;
import std.string;
void main()
{
    string x = null;
    writeln("x     = \"", x, "\"");
    writeln("null  = ", x == null);
    writeln("\"\"    = ", x == "");
    writeln("empty = ", x.empty);
    x = "";
    writeln("\nx     = \"", x, "\"");
    writeln("null  = ", x == null);
    writeln("\"\"    = ", x == "");
    writeln("empty = ", x.empty);
    x = "x";
    writeln("\nx     = \"", x, "\"");
    writeln("null  = ", x == null);
    writeln("\"\"    = ", x == "");
    writeln("empty = ", x.empty);
}

Output:

x     = ""
null  = true
""    = true
empty = true

x     = ""
null  = true
""    = true
empty = true

x     = "x"
null  = false
""    = false
empty = false

1. Why is this?
2. Should I prefer null or ""? I was hoping to return null to indicate "no string that match the criteria", and "some string" otherwise.

February 04, 2020

Re: Empty string vs null

Posted by Jonathan M Davis
in reply to mark

Permalink

Jonathan M Davis

Posted in reply to mark

Permalink

On Tuesday, February 4, 2020 12:33:42 AM MST mark via Digitalmars-d-learn wrote:
> I have just discovered that D seems to treat empty and null strings as the same thing:
>
> // test.d
> import std.stdio;
> import std.string;
> void main()
> {
>      string x = null;
>      writeln("x     = \"", x, "\"");
>      writeln("null  = ", x == null);
>      writeln("\"\"    = ", x == "");
>      writeln("empty = ", x.empty);
>      x = "";
>      writeln("\nx     = \"", x, "\"");
>      writeln("null  = ", x == null);
>      writeln("\"\"    = ", x == "");
>      writeln("empty = ", x.empty);
>      x = "x";
>      writeln("\nx     = \"", x, "\"");
>      writeln("null  = ", x == null);
>      writeln("\"\"    = ", x == "");
>      writeln("empty = ", x.empty);
> }
>
> Output:
>
> x     = ""
> null  = true
> ""    = true
> empty = true
>
> x     = ""
> null  = true
> ""    = true
> empty = true
>
> x     = "x"
> null  = false
> ""    = false
> empty = false
>
> 1. Why is this?

It's a side effect of how dynamic arrays in D are structured. They're basically

struct DynamicArray(T)
{
    size_t length;
    T* ptr;
}

A null array has a length of 0 and ptr which is null. So, if you check length, you get 0. empty checks whether length is 0. So, if you check whether an array is empty, and it happens to be null, then the result is true.

Similarly, the code which checks for equality is going to check for length first. After all, if the lengths don't match, there's no point in comparing the elements in the array. And if the length is 0, then even if the lengths match, there's no point in checking the value of ptr, because the array has no elements. So, whether the array is empty because it's null or whether it's because its length got reduced to 0 is irrelevant.

The natural result of all of this is that D treats null arrays and empty arrays as almost the same thing. They're treating differently if you use the is operator, because that checks that the two values are the same bitwise. For instance, in the case of pointers or classe references, it checks their point values, not what they point to. And in the case of dynamic arrays, it's comparing both the length and ptr values. So, if you want to check whether a dynamic array is really null, then you need to use the is operator instead of ==. e.g.

writeln(arr is null);

instead of

writeln(arr == null);

As a side note, when using an array directly in the condition of an if statement or assertion, it's equivalent to checking whether it's _not_ null. So,

if(arr) {...}

is equivalent to

if(arr !is null) {...}

Because of how a null array is an empty array, some people expect the array to be checked for whether it's non-empty in those situations, which can cause confusion.

> 2. Should I prefer null or ""? I was hoping to return null to indicate "no string that match the criteria", and "some string" otherwise.

In most cases, it really doesn't matter in most situations whether you use null or "" except that "" is automatically a string, whereas null can be used as a literal for any type of dynamic array (in fact typeof(null) is its own type in order to deal with that in generic code). The reason that

"" is null

is false is because all string literals in D have a null character one past their end. This is so that you can pass them directly to C functions without having to explicitly add the null character. e.g. both

printf("hello world");

and

printf("");

work correctly, because the compiler implicitly uses the ptr member of the strings, and the C code happily reads past the end of the array to the null character, whereas ""[0] would throw a RangeError in D code. Strings that aren't literals don't have the null character unless you explicitly put it there, and they require that you use ptr explicitly when calling C functions, but for better or worse, string literals don't force that on you.

There are definitely experienced D programmers who differentiate between null and empty arrays / strings in their code (in fact, that's why if(arr) ultimately wasn't deprecated even though a number of people were pushing for it because of how it confuses many people). However, there are also plenty of D programmers who would argue that you should never treat null as special with arrays because of how null arrays are empty instead of being treated as their own thing.

Personally, I would say that if you want to differentiate between null and empty, it can be done, but you need to be careful - especially if this is going to be a function in a public API rather than something local to your code. It's really easy to end up with a null array when you didn't expect to - especially if your function is calling other functions that return arrays.

So, if you had a function that returned null when it fails, that _can_ work, but you would either have to make sure that success never resulted in an empty array being returned, or you would have to make it clear in the documentation that the is operator must be used to check for null rather than == and ensure that even if an empty array is returned, it will never null. It can work, but ultimately, for public APIs, it's arguably better to use Nullable from std.typecons to differentiate. It has the downside that the return type is larger, but it's less error-prone. For code that isn't part of a public API (especially code that only you work on), it's less risky to explicitly return null rather than using Nullable, but it's still a risk - especially if the code gets changed over time.

- Jonathan M Davis

Forums