View mode: basic / threaded / horizontal-split · Log in · Help
August 01, 2007
const and phobos string functions
I wanted to post this real/concrete case of where 'const' and the 
current phobos implementation of certain string routines combine to 
cause a few irritations, inefficiencies and some IMO illogical behaviour.

Note in the code below 'remove' is a template function which uses 
'memmove' to remove an item from the array.

(The code):

char[][] tmp;

tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
foreach(ref line; tmp) line = cast(char[])strip(line);
for(int i = 0; i < tmp.length; i++)
{
  if (tmp[i].length == 0)
    tmp.remove(i);
}

(The problem):

Notice all the cast()'s required above.  I am of the belief that casting 
should only be done in exceptional circumstances, and this shouldn't be 
one of them.

If you consider char[] as being a a string which you are the full owner 
of, you can change it at will, and 'string' as being a string which you 
are not the full owner of, as you can only read it then in many cases it 
doesn't make sense to pass a char[] to a function and get a 'string' 
back, unless the function has 'taken ownership' of the string data or is 
returning different string data which you are not the owner of.

The obvious case where it make sense is a property setter which might 
copy the input and return a 'string' reference to the copy.

In the cases above (splitlines, strip) it doesn't make sense for the 
function to 'take ownership' of the strings as they do not store them 
beyond the lifetime of the function call.

Therefore if passed a char[] they should return char[][] and char[] 
respectively.

Of course if passed 'string' they should return 'string[]' and 'string' 
respectively.

(The solution):

Given the requirements it seems templating is the obvious solution.  I 
have described in earlier posts (though these relate to tolower 
specifically):

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=55335
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=55337

All the phobos routines should likely be templated in the same fashion.

Regan
August 01, 2007
Re: const and phobos string functions
Regan Heath wrote:
> I wanted to post this real/concrete case of where 'const' and the 
> current phobos implementation of certain string routines combine to 
> cause a few irritations, inefficiencies and some IMO illogical behaviour.
> 
> Note in the code below 'remove' is a template function which uses 
> 'memmove' to remove an item from the array.
> 
> (The code):
> 
> char[][] tmp;
> 
> tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
> foreach(ref line; tmp) line = cast(char[])strip(line);
> for(int i = 0; i < tmp.length; i++)
> {
>   if (tmp[i].length == 0)
>     tmp.remove(i);
> }
> 
> (The problem):
> 
> Notice all the cast()'s required above.  I am of the belief that casting 
> should only be done in exceptional circumstances, and this shouldn't be 
> one of them.
> 
> If you consider char[] as being a a string which you are the full owner 
> of, you can change it at will, and 'string' as being a string which you 
> are not the full owner of, as you can only read it then in many cases it 
> doesn't make sense to pass a char[] to a function and get a 'string' 
> back, unless the function has 'taken ownership' of the string data or is 
> returning different string data which you are not the owner of.
> 
> The obvious case where it make sense is a property setter which might 
> copy the input and return a 'string' reference to the copy.
> 
> In the cases above (splitlines, strip) it doesn't make sense for the 
> function to 'take ownership' of the strings as they do not store them 
> beyond the lifetime of the function call.
> 
> Therefore if passed a char[] they should return char[][] and char[] 
> respectively.
> 
> Of course if passed 'string' they should return 'string[]' and 'string' 
> respectively.
> 
> (The solution):
> 
> Given the requirements it seems templating is the obvious solution.  I 
> have described in earlier posts (though these relate to tolower 
> specifically):
> 
> http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=55335 
> 
> http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=55337 
> 
> 
> All the phobos routines should likely be templated in the same fashion.
> 
> Regan

A lot of those casts go away if tmp is a string[].

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org
August 02, 2007
Re: const and phobos string functions
Kirk McDonald wrote:
> Regan Heath wrote:
>> Note in the code below 'remove' is a template function which uses 
>> 'memmove' to remove an item from the array.
>>
>> (The code):
>>
>> char[][] tmp;
>>
>> tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
>> foreach(ref line; tmp) line = cast(char[])strip(line);
>> for(int i = 0; i < tmp.length; i++)
>> {
>>   if (tmp[i].length == 0)
>>     tmp.remove(i);
>> }
> 
> A lot of those casts go away if tmp is a string[].

True.  I knew someone would realise/spot that.  ;)  I don't think it 
solves the basic problem however, which is why I didn't mention it in me 
original post, perhaps I should have.  Let me explain further...

Imagine calling splitlines on char[] getting string[] then calling 
tolower on each line.

Because you have a string[], tolower always has to dup any line it wants 
to change.

If you template tolower and splitlines so that they accept const and 
non-const input and return the same then in the case where it gets 
non-const input tolower can avoid duplication (AKA *performance gain*)

The same is true for any function (phobos or otherwise) which will (or 
even might) modify the input data.

In the case of splitlines or strip they don't need to modify the input 
data so there is no performance gain by templating them and handling 
string and char[] differently.

However, what you do get is a function which no longer 'takes ownership' 
of the array passed and can then be used in a sequence of modifications 
without having to cast or dup to 'take ownership back' by brute force 
(the problem my code illustrates)

To think of it in the abstract sense, by passing a char[] you're passing 
ownership of the data to the function, when it returns it is passing 
ownership back again in the form char[] or char[][].

When you pass string you're not passing ownership and the function has 
no other option but to duplicate the input when it wants to modify it.

In the current case of tolower it returns string and therefore retains 
ownership of the string, despite the fact that it promptly forgets it 
ever existed.  Not much we can do about this case because we cannot 
decide what to return from a function at runtime.

Regan
August 02, 2007
Re: const and phobos string functions
Regan Heath wrote:
> Kirk McDonald wrote:
>> Regan Heath wrote:
>>> Note in the code below 'remove' is a template function which uses 
>>> 'memmove' to remove an item from the array.
>>>
>>> (The code):
>>>
>>> char[][] tmp;
>>>
>>> tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
>>> foreach(ref line; tmp) line = cast(char[])strip(line);
>>> for(int i = 0; i < tmp.length; i++)
>>> {
>>>   if (tmp[i].length == 0)
>>>     tmp.remove(i);
>>> }
>>
>> A lot of those casts go away if tmp is a string[].
> 
> True.  I knew someone would realise/spot that.  ;)  I don't think it 
> solves the basic problem however, which is why I didn't mention it in me 
> original post, perhaps I should have.  Let me explain further...
> 
> Imagine calling splitlines on char[] getting string[] then calling 
> tolower on each line.
> 
> Because you have a string[], tolower always has to dup any line it wants 
> to change.
> 
> If you template tolower and splitlines so that they accept const and 
> non-const input and return the same then in the case where it gets 
> non-const input tolower can avoid duplication (AKA *performance gain*)
> 
> The same is true for any function (phobos or otherwise) which will (or 
> even might) modify the input data.
> 
> In the case of splitlines or strip they don't need to modify the input 
> data so there is no performance gain by templating them and handling 
> string and char[] differently.
> 
> However, what you do get is a function which no longer 'takes ownership' 
> of the array passed and can then be used in a sequence of modifications 
> without having to cast or dup to 'take ownership back' by brute force 
> (the problem my code illustrates)
> 
> To think of it in the abstract sense, by passing a char[] you're passing 
> ownership of the data to the function, when it returns it is passing 
> ownership back again in the form char[] or char[][].
> 
> When you pass string you're not passing ownership and the function has 
> no other option but to duplicate the input when it wants to modify it.
> 
> In the current case of tolower it returns string and therefore retains 
> ownership of the string, despite the fact that it promptly forgets it 
> ever existed.  Not much we can do about this case because we cannot 
> decide what to return from a function at runtime.
> 
> Regan

In this particular case, you could call this mutating form of tolower() 
on the buffer returned from read(), and then split it afterwards 
(perhaps yielding strings). This makes a certain degree of logical 
sense: If you'd wanted to keep the original contents of the buffer, 
you'd have to duplicate it at some point anyway. Since you don't, you 
can alter it immediately.

I submit the following Python idiom: Functions which mutate their 
arguments should return nothing. That is:

// Return new string
string tolower(string);
// Mutate argument
void tolower(char[]);

This rather strictly highlights the difference between char[] and 
string, and makes it essentially impossible to mix up library functions 
differentiated in this way.

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org
August 02, 2007
Re: const and phobos string functions
Kirk McDonald wrote:
> In this particular case, you could call this mutating form of tolower() 
> on the buffer returned from read(), and then split it afterwards 
> (perhaps yielding strings). This makes a certain degree of logical 
> sense: If you'd wanted to keep the original contents of the buffer, 
> you'd have to duplicate it at some point anyway. Since you don't, you 
> can alter it immediately.

Sure, that solves this particular case.  But, do you agree there is 
generally a problem, or shall I continue to dream up example cases for 
you to solve in some other manner :)

> I submit the following Python idiom: Functions which mutate their 
> arguments should return nothing. That is:
> 
> // Return new string
> string tolower(string);
> // Mutate argument
> void tolower(char[]);
> 
> This rather strictly highlights the difference between char[] and 
> string, and makes it essentially impossible to mix up library functions 
> differentiated in this way.

That's all well and good until you want to write:

char[] s;
...
foo(tolower(s));

If you template tolower in the manner I described it's not possible to 
mix it up and call the wrong one anyway (as it selects the correct one 
based on the input) so it's a non-problem.

Regan
August 02, 2007
Re: const and phobos string functions
Regan Heath wrote:
> Kirk McDonald wrote:
>> In this particular case, you could call this mutating form of 
>> tolower() on the buffer returned from read(), and then split it 
>> afterwards (perhaps yielding strings). This makes a certain degree of 
>> logical sense: If you'd wanted to keep the original contents of the 
>> buffer, you'd have to duplicate it at some point anyway. Since you 
>> don't, you can alter it immediately.
> 
> Sure, that solves this particular case.  But, do you agree there is 
> generally a problem, or shall I continue to dream up example cases for 
> you to solve in some other manner :)
> 
>> I submit the following Python idiom: Functions which mutate their 
>> arguments should return nothing. That is:
>>
>> // Return new string
>> string tolower(string);
>> // Mutate argument
>> void tolower(char[]);
>>
>> This rather strictly highlights the difference between char[] and 
>> string, and makes it essentially impossible to mix up library 
>> functions differentiated in this way.
> 
> That's all well and good until you want to write:
> 
> char[] s;
> ...
> foo(tolower(s));
> 
> If you template tolower in the manner I described it's not possible to 
> mix it up and call the wrong one anyway (as it selects the correct one 
> based on the input) so it's a non-problem.
> 
> Regan

I think the code you gave here would be misleading. For instance, 
suppose you had your templated overloads, effectively giving you

   // copy-on-write
   string tolower(string);
   // in-place
   char[] tolower(char[]);

Then you would get surprising results if you did this:

char[] s = "Hello World!".dup;
int a = howManyLettersDiffer(tolower(s), s);
assert (a == 2); // assert failed, a is actually 0

I think avoiding this is exactly why Kirk suggested his overloads.

  -- Reiner
August 02, 2007
Re: const and phobos string functions
Reiner Pope wrote:
> Regan Heath wrote:
>> Kirk McDonald wrote:
>>> I submit the following Python idiom: Functions which mutate their 
>>> arguments should return nothing. That is:
>>>
>>> // Return new string
>>> string tolower(string);
>>> // Mutate argument
>>> void tolower(char[]);
>>>
>>> This rather strictly highlights the difference between char[] and 
>>> string, and makes it essentially impossible to mix up library 
>>> functions differentiated in this way.
>>
>> That's all well and good until you want to write:
>>
>> char[] s;
>> ...
>> foo(tolower(s));
>>
>> If you template tolower in the manner I described it's not possible to 
>> mix it up and call the wrong one anyway (as it selects the correct one 
>> based on the input) so it's a non-problem.
>>
>> Regan
> 
> I think the code you gave here would be misleading. For instance, 
> suppose you had your templated overloads, effectively giving you
> 
>    // copy-on-write
>    string tolower(string);
>    // in-place
>    char[] tolower(char[]);
> 
> Then you would get surprising results if you did this:
> 
> char[] s = "Hello World!".dup;
> int a = howManyLettersDiffer(tolower(s), s);
> assert (a == 2); // assert failed, a is actually 0
> 
> I think avoiding this is exactly why Kirk suggested his overloads.

Good point.

Using Kirk's Python'esque function signatures you'd have:

char[] s = "Hello World!".dup;
tolower(s);
int a = howManyLettersDiffer(s, s);
assert (a == 2); // assert failed, a is actually 0

which is immediately and obviously a pointless operation, requring a 
re-code to:

char[] s = "Hello World!".dup;
char[] s2 = s.dup
tolower(s2);
int a = howManyLettersDiffer(s2, s);
assert (a == 2); // assert failed, a is actually 0

And the 'string' case:

string s = "Hello World!";
int a = howManyLettersDiffer(tolower(s), s);
assert (a == 2); // assert failed, a is actually 0

which would already behave as desired.

It seems that tolower is always going to have to 'copy on write' and 
return that copy.

But, that doesn't stop you templating it so that it returns the copy as 
char[] when you pass char[] (which is all I really wanted in the first 
place) :)

Unfortunately as we cannot overload on return type it would prevent an 
inplace tolower in the form (given by Kirk):

void tolower(char[])

Instead perhaps we need a naming convention for inplace modifying 
functions, options:

void <func>Inplace(<args>)
void <func>IP(<args>)       // "IP" stands for inplace
void <func>M(<args>)        // "M" stands for mutates

or something like those.

Regan
August 02, 2007
Re: const and phobos string functions
Regan Heath wrote:
> Reiner Pope wrote:
>> Regan Heath wrote:
>>> Kirk McDonald wrote:
>>>> I submit the following Python idiom: Functions which mutate their 
>>>> arguments should return nothing. That is:
>>>>
>>>> // Return new string
>>>> string tolower(string);
>>>> // Mutate argument
>>>> void tolower(char[]);
>>>>
>>>> This rather strictly highlights the difference between char[] and 
>>>> string, and makes it essentially impossible to mix up library 
>>>> functions differentiated in this way.
>>>
>>> That's all well and good until you want to write:
>>>
>>> char[] s;
>>> ...
>>> foo(tolower(s));
>>>
>>> If you template tolower in the manner I described it's not possible 
>>> to mix it up and call the wrong one anyway (as it selects the correct 
>>> one based on the input) so it's a non-problem.
>>>
>>> Regan
>>
>> I think the code you gave here would be misleading. For instance, 
>> suppose you had your templated overloads, effectively giving you
>>
>>    // copy-on-write
>>    string tolower(string);
>>    // in-place
>>    char[] tolower(char[]);
>>
>> Then you would get surprising results if you did this:
>>
>> char[] s = "Hello World!".dup;
>> int a = howManyLettersDiffer(tolower(s), s);
>> assert (a == 2); // assert failed, a is actually 0
>>
>> I think avoiding this is exactly why Kirk suggested his overloads.
> 
> Good point.
> 
> Using Kirk's Python'esque function signatures you'd have:
> 
> char[] s = "Hello World!".dup;
> tolower(s);
> int a = howManyLettersDiffer(s, s);
> assert (a == 2); // assert failed, a is actually 0
> 
> which is immediately and obviously a pointless operation, requring a 
> re-code to:
> 
> char[] s = "Hello World!".dup;
> char[] s2 = s.dup
> tolower(s2);
> int a = howManyLettersDiffer(s2, s);
> assert (a == 2); // assert failed, a is actually 0
> 
> And the 'string' case:
> 
> string s = "Hello World!";
> int a = howManyLettersDiffer(tolower(s), s);
> assert (a == 2); // assert failed, a is actually 0
> 
> which would already behave as desired.
> 
> It seems that tolower is always going to have to 'copy on write' and 
> return that copy.
> 
> But, that doesn't stop you templating it so that it returns the copy as 
> char[] when you pass char[] (which is all I really wanted in the first 
> place) :)
> 
> Unfortunately as we cannot overload on return type it would prevent an 
> inplace tolower in the form (given by Kirk):
> 
> void tolower(char[])
> 
> Instead perhaps we need a naming convention for inplace modifying 
> functions, options:
> 
> void <func>Inplace(<args>)
> void <func>IP(<args>)       // "IP" stands for inplace
> void <func>M(<args>)        // "M" stands for mutates
> 
> or something like those.
> 
> Regan

A lot of functions just to do the string operation efficiently, though, 
isn't it? It's easy to see in cases like this why Walter cites the 
Pareto principle, saying mostly this isn't the bottleneck.

But still, isn't D supposed to give you fast code _easily_ ?

I must say, I *like* the simplicity of overloading with a mutable and a 
readonly version, although we have established that it can certainly 
lead to some confusion. Mind you, D arrays ignore that potential 
confusion (in the .sort and .reverse properties).

I also had some fancy ideas for a CoW wrapper, which looks something like:

struct CoW_Wrapper(T)
{
  const(T) val;
  private bool isMutable;

  void opAssign(T t)
  {
    val = t;
    isMutable = true;
  }

  void opAssign(const(T) t)
  {
    val = t;
    isMutable = false;
  }

  T mutable()
  {
    if (!isMutable)
    {
      val = val.dup;
      isMutable = true;
    }
    return cast(T) val;
  }
}

This would make using CoW quite simple, and would also solve the problem 
of keeping track across function boundaries of whether it's been duped, 
for things like:

   string s = ...;
   string s2 = s.tolower().replace("foo", "bar").entab();

which could potentially involve 3 dups. (I'm aware of the easier 
solution of templating to support char[]->char[] and string->string 
overloads, but let's have some fun going the extra way :-) )

You would write something like tolower as (ignoring unicode stuff):

alias CoW_Wrapper!(char[]) cow_string;

cow_string tolower(cow_string s)
{
  foreach (i; 0..s.val.length)
  {
    if (s.val[i] >= 'A' && s.val[i] <= 'Z')
      s.mutable[i] = s.val[i] + 'a' - 'A';
  }
  return s;
}

The neat feature is that the cow_string returned knows whether it 
isMutable, so you will always have the fewest required dups: 0 or 1.

Unfortunately, it doesn't currently look very nice because of the lack 
of opImplicitCast, which means you'd have to write:

  string s2 = s.tolower.val;

instead of

  string s2 = s.tolower;

(there's also problems with the type of s, but that can be solved with 
some template stuff on the tolower side; those template manipulations 
can also bypass the cow_string struct if the parameter type is char[], 
by defining
   char[] mutable(char[] s) { return s; }
)

  -- Reiner
August 02, 2007
Re: const and phobos string functions
Regan Heath wrote:
> Instead perhaps we need a naming convention for inplace modifying
> functions, options:
> 
> void <func>Inplace(<args>)
> void <func>IP(<args>)       // "IP" stands for inplace
> void <func>M(<args>)        // "M" stands for mutates
> 
> or something like those.
> 
> Regan

I actually tried doing this in a math library of mine, but I didn't like
using normal letters; I wanted to use a Unicode character that was easy
to type, showed up in Vim, and would visually distinguish the functions.

Sadly, all I could come up with[1] for "in place" was '¶', and the
compiler didn't like that :(

Sometimes, I wonder if it wouldn't help a lot to be able to define
function overloads based on argument decoration.  So, you could have
something like:

 tolower(str);
 tolower(inplace str);
 tolower(takecopy str);

There are so many times where you have several functions that do the
exact same thing (semantically speaking), but have different
implementations.

Maybe I could fake it with structs...

struct tolower
{
   string opCall(string str);
   void inplace(char[] str);
   char[] takecopy(char[] str);
   char[] takecopy(string str);
}

Something to think about, anyway...

	-- Daniel

[1] That's what you get if you type ^KIP in Vim.  If you type ^Kip, you
get 'ぴ', and ^KiP is 'ピ' :P
August 02, 2007
Re: const and phobos string functions
Daniel Keep wrote:
> struct tolower
> {
>     string opCall(string str);
>     void inplace(char[] str);
>     char[] takecopy(char[] str);
>     char[] takecopy(string str);
> }

Those members should, of course, be static.  me.duh();
« First   ‹ Prev
1 2
Top | Discussion index | About this forum | D home