Jump to page: 1 2
Thread overview
const and phobos string functions
Aug 01, 2007
Regan Heath
Aug 01, 2007
Kirk McDonald
Aug 02, 2007
Regan Heath
Aug 02, 2007
Kirk McDonald
Aug 02, 2007
Regan Heath
Aug 02, 2007
Reiner Pope
Aug 02, 2007
Regan Heath
Aug 02, 2007
Reiner Pope
Aug 02, 2007
Regan Heath
Aug 02, 2007
Daniel Keep
Aug 02, 2007
Daniel Keep
August 01, 2007
I wanted to post this real/concrete case of where 'const' and the current phobos implementation of certain string routines combine to cause a few irritations, inefficiencies and some IMO illogical behaviour.

Note in the code below 'remove' is a template function which uses 'memmove' to remove an item from the array.

(The code):

char[][] tmp;

tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
foreach(ref line; tmp) line = cast(char[])strip(line);
for(int i = 0; i < tmp.length; i++)
{
  if (tmp[i].length == 0)
    tmp.remove(i);
}

(The problem):

Notice all the cast()'s required above.  I am of the belief that casting should only be done in exceptional circumstances, and this shouldn't be one of them.

If you consider char[] as being a a string which you are the full owner of, you can change it at will, and 'string' as being a string which you are not the full owner of, as you can only read it then in many cases it doesn't make sense to pass a char[] to a function and get a 'string' back, unless the function has 'taken ownership' of the string data or is returning different string data which you are not the owner of.

The obvious case where it make sense is a property setter which might copy the input and return a 'string' reference to the copy.

In the cases above (splitlines, strip) it doesn't make sense for the function to 'take ownership' of the strings as they do not store them beyond the lifetime of the function call.

Therefore if passed a char[] they should return char[][] and char[] respectively.

Of course if passed 'string' they should return 'string[]' and 'string' respectively.

(The solution):

Given the requirements it seems templating is the obvious solution.  I have described in earlier posts (though these relate to tolower specifically):

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=55335
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=55337

All the phobos routines should likely be templated in the same fashion.

Regan
August 01, 2007
Regan Heath wrote:
> I wanted to post this real/concrete case of where 'const' and the current phobos implementation of certain string routines combine to cause a few irritations, inefficiencies and some IMO illogical behaviour.
> 
> Note in the code below 'remove' is a template function which uses 'memmove' to remove an item from the array.
> 
> (The code):
> 
> char[][] tmp;
> 
> tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
> foreach(ref line; tmp) line = cast(char[])strip(line);
> for(int i = 0; i < tmp.length; i++)
> {
>   if (tmp[i].length == 0)
>     tmp.remove(i);
> }
> 
> (The problem):
> 
> Notice all the cast()'s required above.  I am of the belief that casting should only be done in exceptional circumstances, and this shouldn't be one of them.
> 
> If you consider char[] as being a a string which you are the full owner of, you can change it at will, and 'string' as being a string which you are not the full owner of, as you can only read it then in many cases it doesn't make sense to pass a char[] to a function and get a 'string' back, unless the function has 'taken ownership' of the string data or is returning different string data which you are not the owner of.
> 
> The obvious case where it make sense is a property setter which might copy the input and return a 'string' reference to the copy.
> 
> In the cases above (splitlines, strip) it doesn't make sense for the function to 'take ownership' of the strings as they do not store them beyond the lifetime of the function call.
> 
> Therefore if passed a char[] they should return char[][] and char[] respectively.
> 
> Of course if passed 'string' they should return 'string[]' and 'string' respectively.
> 
> (The solution):
> 
> Given the requirements it seems templating is the obvious solution.  I have described in earlier posts (though these relate to tolower specifically):
> 
> http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=55335 
> 
> http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=55337 
> 
> 
> All the phobos routines should likely be templated in the same fashion.
> 
> Regan

A lot of those casts go away if tmp is a string[].

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org
August 02, 2007
Kirk McDonald wrote:
> Regan Heath wrote:
>> Note in the code below 'remove' is a template function which uses 'memmove' to remove an item from the array.
>>
>> (The code):
>>
>> char[][] tmp;
>>
>> tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
>> foreach(ref line; tmp) line = cast(char[])strip(line);
>> for(int i = 0; i < tmp.length; i++)
>> {
>>   if (tmp[i].length == 0)
>>     tmp.remove(i);
>> }
> 
> A lot of those casts go away if tmp is a string[].

True.  I knew someone would realise/spot that.  ;)  I don't think it solves the basic problem however, which is why I didn't mention it in me original post, perhaps I should have.  Let me explain further...

Imagine calling splitlines on char[] getting string[] then calling tolower on each line.

Because you have a string[], tolower always has to dup any line it wants to change.

If you template tolower and splitlines so that they accept const and non-const input and return the same then in the case where it gets non-const input tolower can avoid duplication (AKA *performance gain*)

The same is true for any function (phobos or otherwise) which will (or even might) modify the input data.

In the case of splitlines or strip they don't need to modify the input data so there is no performance gain by templating them and handling string and char[] differently.

However, what you do get is a function which no longer 'takes ownership' of the array passed and can then be used in a sequence of modifications without having to cast or dup to 'take ownership back' by brute force (the problem my code illustrates)

To think of it in the abstract sense, by passing a char[] you're passing ownership of the data to the function, when it returns it is passing ownership back again in the form char[] or char[][].

When you pass string you're not passing ownership and the function has no other option but to duplicate the input when it wants to modify it.

In the current case of tolower it returns string and therefore retains ownership of the string, despite the fact that it promptly forgets it ever existed.  Not much we can do about this case because we cannot decide what to return from a function at runtime.

Regan
August 02, 2007
Regan Heath wrote:
> Kirk McDonald wrote:
>> Regan Heath wrote:
>>> Note in the code below 'remove' is a template function which uses 'memmove' to remove an item from the array.
>>>
>>> (The code):
>>>
>>> char[][] tmp;
>>>
>>> tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
>>> foreach(ref line; tmp) line = cast(char[])strip(line);
>>> for(int i = 0; i < tmp.length; i++)
>>> {
>>>   if (tmp[i].length == 0)
>>>     tmp.remove(i);
>>> }
>>
>> A lot of those casts go away if tmp is a string[].
> 
> True.  I knew someone would realise/spot that.  ;)  I don't think it solves the basic problem however, which is why I didn't mention it in me original post, perhaps I should have.  Let me explain further...
> 
> Imagine calling splitlines on char[] getting string[] then calling tolower on each line.
> 
> Because you have a string[], tolower always has to dup any line it wants to change.
> 
> If you template tolower and splitlines so that they accept const and non-const input and return the same then in the case where it gets non-const input tolower can avoid duplication (AKA *performance gain*)
> 
> The same is true for any function (phobos or otherwise) which will (or even might) modify the input data.
> 
> In the case of splitlines or strip they don't need to modify the input data so there is no performance gain by templating them and handling string and char[] differently.
> 
> However, what you do get is a function which no longer 'takes ownership' of the array passed and can then be used in a sequence of modifications without having to cast or dup to 'take ownership back' by brute force (the problem my code illustrates)
> 
> To think of it in the abstract sense, by passing a char[] you're passing ownership of the data to the function, when it returns it is passing ownership back again in the form char[] or char[][].
> 
> When you pass string you're not passing ownership and the function has no other option but to duplicate the input when it wants to modify it.
> 
> In the current case of tolower it returns string and therefore retains ownership of the string, despite the fact that it promptly forgets it ever existed.  Not much we can do about this case because we cannot decide what to return from a function at runtime.
> 
> Regan

In this particular case, you could call this mutating form of tolower() on the buffer returned from read(), and then split it afterwards (perhaps yielding strings). This makes a certain degree of logical sense: If you'd wanted to keep the original contents of the buffer, you'd have to duplicate it at some point anyway. Since you don't, you can alter it immediately.

I submit the following Python idiom: Functions which mutate their arguments should return nothing. That is:

// Return new string
string tolower(string);
// Mutate argument
void tolower(char[]);

This rather strictly highlights the difference between char[] and string, and makes it essentially impossible to mix up library functions differentiated in this way.

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org
August 02, 2007
Kirk McDonald wrote:
> In this particular case, you could call this mutating form of tolower() on the buffer returned from read(), and then split it afterwards (perhaps yielding strings). This makes a certain degree of logical sense: If you'd wanted to keep the original contents of the buffer, you'd have to duplicate it at some point anyway. Since you don't, you can alter it immediately.

Sure, that solves this particular case.  But, do you agree there is generally a problem, or shall I continue to dream up example cases for you to solve in some other manner :)

> I submit the following Python idiom: Functions which mutate their arguments should return nothing. That is:
> 
> // Return new string
> string tolower(string);
> // Mutate argument
> void tolower(char[]);
> 
> This rather strictly highlights the difference between char[] and string, and makes it essentially impossible to mix up library functions differentiated in this way.

That's all well and good until you want to write:

char[] s;
...
foo(tolower(s));

If you template tolower in the manner I described it's not possible to mix it up and call the wrong one anyway (as it selects the correct one based on the input) so it's a non-problem.

Regan
August 02, 2007
Regan Heath wrote:
> Kirk McDonald wrote:
>> In this particular case, you could call this mutating form of tolower() on the buffer returned from read(), and then split it afterwards (perhaps yielding strings). This makes a certain degree of logical sense: If you'd wanted to keep the original contents of the buffer, you'd have to duplicate it at some point anyway. Since you don't, you can alter it immediately.
> 
> Sure, that solves this particular case.  But, do you agree there is generally a problem, or shall I continue to dream up example cases for you to solve in some other manner :)
> 
>> I submit the following Python idiom: Functions which mutate their arguments should return nothing. That is:
>>
>> // Return new string
>> string tolower(string);
>> // Mutate argument
>> void tolower(char[]);
>>
>> This rather strictly highlights the difference between char[] and string, and makes it essentially impossible to mix up library functions differentiated in this way.
> 
> That's all well and good until you want to write:
> 
> char[] s;
> ...
> foo(tolower(s));
> 
> If you template tolower in the manner I described it's not possible to mix it up and call the wrong one anyway (as it selects the correct one based on the input) so it's a non-problem.
> 
> Regan

I think the code you gave here would be misleading. For instance, suppose you had your templated overloads, effectively giving you

   // copy-on-write
   string tolower(string);
   // in-place
   char[] tolower(char[]);

Then you would get surprising results if you did this:

char[] s = "Hello World!".dup;
int a = howManyLettersDiffer(tolower(s), s);
assert (a == 2); // assert failed, a is actually 0

I think avoiding this is exactly why Kirk suggested his overloads.

  -- Reiner
August 02, 2007
Reiner Pope wrote:
> Regan Heath wrote:
>> Kirk McDonald wrote:
>>> I submit the following Python idiom: Functions which mutate their arguments should return nothing. That is:
>>>
>>> // Return new string
>>> string tolower(string);
>>> // Mutate argument
>>> void tolower(char[]);
>>>
>>> This rather strictly highlights the difference between char[] and string, and makes it essentially impossible to mix up library functions differentiated in this way.
>>
>> That's all well and good until you want to write:
>>
>> char[] s;
>> ...
>> foo(tolower(s));
>>
>> If you template tolower in the manner I described it's not possible to mix it up and call the wrong one anyway (as it selects the correct one based on the input) so it's a non-problem.
>>
>> Regan
> 
> I think the code you gave here would be misleading. For instance, suppose you had your templated overloads, effectively giving you
> 
>    // copy-on-write
>    string tolower(string);
>    // in-place
>    char[] tolower(char[]);
> 
> Then you would get surprising results if you did this:
> 
> char[] s = "Hello World!".dup;
> int a = howManyLettersDiffer(tolower(s), s);
> assert (a == 2); // assert failed, a is actually 0
> 
> I think avoiding this is exactly why Kirk suggested his overloads.

Good point.

Using Kirk's Python'esque function signatures you'd have:

char[] s = "Hello World!".dup;
tolower(s);
int a = howManyLettersDiffer(s, s);
assert (a == 2); // assert failed, a is actually 0

which is immediately and obviously a pointless operation, requring a re-code to:

char[] s = "Hello World!".dup;
char[] s2 = s.dup
tolower(s2);
int a = howManyLettersDiffer(s2, s);
assert (a == 2); // assert failed, a is actually 0

And the 'string' case:

string s = "Hello World!";
int a = howManyLettersDiffer(tolower(s), s);
assert (a == 2); // assert failed, a is actually 0

which would already behave as desired.

It seems that tolower is always going to have to 'copy on write' and return that copy.

But, that doesn't stop you templating it so that it returns the copy as char[] when you pass char[] (which is all I really wanted in the first place) :)

Unfortunately as we cannot overload on return type it would prevent an inplace tolower in the form (given by Kirk):

void tolower(char[])

Instead perhaps we need a naming convention for inplace modifying functions, options:

void <func>Inplace(<args>)
void <func>IP(<args>)       // "IP" stands for inplace
void <func>M(<args>)        // "M" stands for mutates

or something like those.

Regan
August 02, 2007
Regan Heath wrote:
> Reiner Pope wrote:
>> Regan Heath wrote:
>>> Kirk McDonald wrote:
>>>> I submit the following Python idiom: Functions which mutate their arguments should return nothing. That is:
>>>>
>>>> // Return new string
>>>> string tolower(string);
>>>> // Mutate argument
>>>> void tolower(char[]);
>>>>
>>>> This rather strictly highlights the difference between char[] and string, and makes it essentially impossible to mix up library functions differentiated in this way.
>>>
>>> That's all well and good until you want to write:
>>>
>>> char[] s;
>>> ...
>>> foo(tolower(s));
>>>
>>> If you template tolower in the manner I described it's not possible to mix it up and call the wrong one anyway (as it selects the correct one based on the input) so it's a non-problem.
>>>
>>> Regan
>>
>> I think the code you gave here would be misleading. For instance, suppose you had your templated overloads, effectively giving you
>>
>>    // copy-on-write
>>    string tolower(string);
>>    // in-place
>>    char[] tolower(char[]);
>>
>> Then you would get surprising results if you did this:
>>
>> char[] s = "Hello World!".dup;
>> int a = howManyLettersDiffer(tolower(s), s);
>> assert (a == 2); // assert failed, a is actually 0
>>
>> I think avoiding this is exactly why Kirk suggested his overloads.
> 
> Good point.
> 
> Using Kirk's Python'esque function signatures you'd have:
> 
> char[] s = "Hello World!".dup;
> tolower(s);
> int a = howManyLettersDiffer(s, s);
> assert (a == 2); // assert failed, a is actually 0
> 
> which is immediately and obviously a pointless operation, requring a re-code to:
> 
> char[] s = "Hello World!".dup;
> char[] s2 = s.dup
> tolower(s2);
> int a = howManyLettersDiffer(s2, s);
> assert (a == 2); // assert failed, a is actually 0
> 
> And the 'string' case:
> 
> string s = "Hello World!";
> int a = howManyLettersDiffer(tolower(s), s);
> assert (a == 2); // assert failed, a is actually 0
> 
> which would already behave as desired.
> 
> It seems that tolower is always going to have to 'copy on write' and return that copy.
> 
> But, that doesn't stop you templating it so that it returns the copy as char[] when you pass char[] (which is all I really wanted in the first place) :)
> 
> Unfortunately as we cannot overload on return type it would prevent an inplace tolower in the form (given by Kirk):
> 
> void tolower(char[])
> 
> Instead perhaps we need a naming convention for inplace modifying functions, options:
> 
> void <func>Inplace(<args>)
> void <func>IP(<args>)       // "IP" stands for inplace
> void <func>M(<args>)        // "M" stands for mutates
> 
> or something like those.
> 
> Regan

A lot of functions just to do the string operation efficiently, though, isn't it? It's easy to see in cases like this why Walter cites the Pareto principle, saying mostly this isn't the bottleneck.

But still, isn't D supposed to give you fast code _easily_ ?

I must say, I *like* the simplicity of overloading with a mutable and a readonly version, although we have established that it can certainly lead to some confusion. Mind you, D arrays ignore that potential confusion (in the .sort and .reverse properties).

I also had some fancy ideas for a CoW wrapper, which looks something like:

struct CoW_Wrapper(T)
{
  const(T) val;
  private bool isMutable;

  void opAssign(T t)
  {
    val = t;
    isMutable = true;
  }

  void opAssign(const(T) t)
  {
    val = t;
    isMutable = false;
  }

  T mutable()
  {
    if (!isMutable)
    {
      val = val.dup;
      isMutable = true;
    }
    return cast(T) val;
  }
}

This would make using CoW quite simple, and would also solve the problem of keeping track across function boundaries of whether it's been duped, for things like:

   string s = ...;
   string s2 = s.tolower().replace("foo", "bar").entab();

which could potentially involve 3 dups. (I'm aware of the easier solution of templating to support char[]->char[] and string->string overloads, but let's have some fun going the extra way :-) )

You would write something like tolower as (ignoring unicode stuff):

alias CoW_Wrapper!(char[]) cow_string;

cow_string tolower(cow_string s)
{
  foreach (i; 0..s.val.length)
  {
    if (s.val[i] >= 'A' && s.val[i] <= 'Z')
      s.mutable[i] = s.val[i] + 'a' - 'A';
  }
  return s;
}

The neat feature is that the cow_string returned knows whether it isMutable, so you will always have the fewest required dups: 0 or 1.

Unfortunately, it doesn't currently look very nice because of the lack of opImplicitCast, which means you'd have to write:

  string s2 = s.tolower.val;

instead of

  string s2 = s.tolower;

(there's also problems with the type of s, but that can be solved with some template stuff on the tolower side; those template manipulations can also bypass the cow_string struct if the parameter type is char[], by defining
   char[] mutable(char[] s) { return s; }
)

  -- Reiner
August 02, 2007

Regan Heath wrote:
> Instead perhaps we need a naming convention for inplace modifying functions, options:
> 
> void <func>Inplace(<args>)
> void <func>IP(<args>)       // "IP" stands for inplace
> void <func>M(<args>)        // "M" stands for mutates
> 
> or something like those.
> 
> Regan

I actually tried doing this in a math library of mine, but I didn't like using normal letters; I wanted to use a Unicode character that was easy to type, showed up in Vim, and would visually distinguish the functions.

Sadly, all I could come up with[1] for "in place" was '¶', and the compiler didn't like that :(

Sometimes, I wonder if it wouldn't help a lot to be able to define function overloads based on argument decoration.  So, you could have something like:

  tolower(str);
  tolower(inplace str);
  tolower(takecopy str);

There are so many times where you have several functions that do the exact same thing (semantically speaking), but have different implementations.

Maybe I could fake it with structs...

struct tolower
{
    string opCall(string str);
    void inplace(char[] str);
    char[] takecopy(char[] str);
    char[] takecopy(string str);
}

Something to think about, anyway...

	-- Daniel

[1] That's what you get if you type ^KIP in Vim.  If you type ^Kip, you get 'ぴ', and ^KiP is 'ピ' :P
August 02, 2007

Daniel Keep wrote:
> struct tolower
> {
>     string opCall(string str);
>     void inplace(char[] str);
>     char[] takecopy(char[] str);
>     char[] takecopy(string str);
> }

Those members should, of course, be static.  me.duh();
« First   ‹ Prev
1 2