July 05, 2007
Regan Heath wrote:
> Walter Bright Wrote:
>> Derek Parnell wrote:
>>> However, if I might need to update it ...
>>>
>>>    char[] fullpath;
>>>
>>>    fullpath = CanonicalPath(shortname).dup;
>>>    version(Windows)
>>>    {
>>>       setLowerCase(fullpath);
>>>    }
>>>
>>> The point is that the 'CanonicalPath' function hasn't got a clue what the
>>> calling function is intending to do with the result so it is trying to be
>>> responsible by guarding it against mistakes by the caller.
>> If you write it like this:
>>
>> string fullpath;
>>
>> fullpath = CanonicalPath(shortname);
>> version(Windows)
>> {
>>        fullpath = std.string.tolower(fullpath);
>> }
>>
>> you won't need to do the .dup .
> 
> Because tolower does it for you, but it still returns string and if for example you need to add something to the end of the path, like a filename you will end up doing yet another dup somewhere.
> 
> I think the solution may be to template all functions which return the input string, or part of the input string, eg.
> 
> T tolower(T)(T input)
> {
> }
> 
> That way if you call it with char[] you get a char[] back, if you call it with string you get a string back.
> 

It doesn't make sense to template it, because you'd still have two different function versions, that would work differently. The one that receives a string does a dup, the one that receives a char[] does not dup. The return type of tolower(string str) might also be char[] and not string, if tolower(string str) would allways does a dup, even if no character modifications are necessary.


-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
July 05, 2007
On Thu, 05 Jul 2007 01:18:28 +0300, Derek Parnell <derek@psych.ward> wrote:
> I'm converting Bud to compile using V2 and so far its been a very hard
> thing to do. I'm finding that I'm now having to use '.dup' and '.idup' all
> over the place, which is exactly what I thought would happen. Bud does a
> lot of text manipulation so having 'string' as invariant means that calls
> to functions that return string need to often be .dup'ed because I need to
> assign the result to a malleable variable.
>
> I might have to rethink of the design of the application to avoid the
> performance hit of all these dups.
>

That got me thinking about string functions in general.

First, I am wondering why some functions are formed as follows:
(but I'm sure someone will (hopefully) enlight me about that ;) )

  string foo(string bar);

That is, if they return something else than 'bar' (they do some string manipulation).
Shouldn't they return char[] instead? For example:

  char[] foo(string bar) {
    return bar ~ "blah";
  }


And this brings us to the 'tolower()' function (for instance).

Sometimes it .dups and sometimes it doesn't. So, if I don't know if the input string
contains upper cased chars, I have to .dup the return value, even if it may already
been .dupped by 'tolower()'...

  char[] a = "abc".dup;
  char[] b = tolower(a).dub;  //.dupped once ('tolower()' returns plain 'a')

  char[] a = "ABC".dup;
  char[] b = tolower(a).dub;  //.dupped twice!

So 'tolower()' is a hybrid of two function groups:
(1) functions that modify the input string,
(2) functions that returns a (modified) copy of the input string.

(If the input string doesn't contains upper cased chars it behaves like (1)
(even if it doesn't actually modify the input string), otherwise it behaves like (2).)

I don't think this is a good thing.
There should be two different functions, one for each group:

  char[] tolower(char[] str);  //modifies and returns 'str'

  char[] getlower(string str);  //returns a copy


If one likes the copy-on-write behaviour of 'tolower(), I think it would
work only by using reference counting.

For example (the 'String' class uses reference counting):

  String a, b;

  a = "abc";
  b = tolower(a);  //'b' points to 'a' ('tolower()' simply returns 'a')

  b[0] = 'x';  //'b' .dups its contents before modification, so 'a' is not changed
July 05, 2007
Derek Parnell wrote:
> On Thu, 05 Jul 2007 00:15:41 -0700, Sean Kelly wrote:
> 
>> Derek Parnell wrote:
>>> I'm converting Bud to compile using V2 and so far its been a very hard
>>> thing to do. I'm finding that I'm now having to use '.dup' and '.idup' all
>>> over the place, which is exactly what I thought would happen. Bud does a
>>> lot of text manipulation so having 'string' as invariant means that calls
>>> to functions that return string need to often be .dup'ed because I need to
>>> assign the result to a malleable variable. 
>> So just use char[] instead of 'string'.  I don't plan to use the aliases much either.
> 
> It's not so clear cut. Firstly, a lot of phobos routines now return
> 'string' results and expect 'string' inputs.

I'd argue that the parameters should be "const char[]" rather than "string", and it's hard to say for the return values.

> Secondly, I like the idea of
> general purpose functions returning 'const' data, because it helps guard
> against inadvertent modifications by the calling routines. It is up to the
> calling function to explicitly decide if it is going to modify returned
> stuff or not.
> 
> For example, if I know that I'll not need to modify the 'fullpath' then I
> might do this ...
> 
>    string fullpath;
> 
>    fullpath = CanonicalPath(shortname);

I would say that whether the return value is const/invariant indicates ownership.  If the called function/class owns the data then it is const or invariant.  If it does not then it is not const/invariant.  This seems to largely limit "string" as a return value to property methods.

> However, if I might need to update it ...
> 
>    char[] fullpath;
> 
>    fullpath = CanonicalPath(shortname).dup;
>    version(Windows)
>    {
>       setLowerCase(fullpath);
>    }
> 
> The point is that the 'CanonicalPath' function hasn't got a clue what the
> calling function is intending to do with the result so it is trying to be
> responsible by guarding it against mistakes by the caller.

Right.  See above.


Sean
July 05, 2007
Regan Heath wrote:
> Walter Bright Wrote:
>> string fullpath;
>>
>> fullpath = CanonicalPath(shortname);
>> version(Windows)
>> {
>>        fullpath = std.string.tolower(fullpath);
>> }
>>
>> you won't need to do the .dup .
> 
> Because tolower does it for you, but it still returns string

tolower only dups the string if it needs to. It won't dup a string that is already in lower case.

> and if for example
> you need to add something to the end of the path, like a filename you > will end up
> doing yet another dup somewhere.

Concatenating strings does not require a .dup.
July 05, 2007
Derek Parnell wrote:
> I'll make my code example more free from assumed functionality.
> 
> 
>  char[] qwerty;
>   qwerty = KJHGF(poiuy).dup;
>  version(xyzzy)
>  {
>      MNBVC(qwerty);
>  }
> 
> As you can see, my point is made without regard to converting stuff to
> lower case.

My point is that the way the snippet is written is inside out. Do not use .dup to preemptively make a copy in case it gets changed somewhere later one. The style is to make a .dup *only if* the contents will be changed and do the .dup *at the site* of the modification.

In other words, dups should be done from the bottom up, not from the top down.

I think such a style helps fit things together nicely and avoids strange .dups appearing in inexplicable places.
July 05, 2007
Regan Heath wrote:
> Aaargh!  You're confusing empty and non-existant (null) again!  <g>

In this case, no.

> In some cases there is an important difference between the two.

The only case is when you're extending into a preallocated buffer. Such cannot be the case with string literals.
July 05, 2007
Derek Parnell wrote:
>> This should do it nicely:
>>
>> 	text = null;
> 
> Not really. I want an empty text and not a non-text.

Such a distinction is critical in C code, but is not of much use in D code. What do you need the distinction for?

> Also, it doesn't fit
> right with other data types - the consistency thing again.
> 
>    text = typeof(text).init; 
> 
> works better for me because I can also use this construct in templates
> without problems.

The .init for char[] is null, not "".

> But really, this thread can die now. I didn't mean to go off into weird
> tangental subects.

I think you've raised a couple of very important stylistic issues, and it is worth pursuing.
July 05, 2007
Kristian Kilpi wrote:
> First, I am wondering why some functions are formed as follows:
> (but I'm sure someone will (hopefully) enlight me about that ;) )
> 
>   string foo(string bar);
> 
> That is, if they return something else than 'bar' (they do some string manipulation).
> Shouldn't they return char[] instead?

No, because then they must always dup the string. If they don't need to dup the string, they can return a reference to the parameter, and if so, it must be const.

> There should be two different functions, one for each group:
> 
>   char[] tolower(char[] str);  //modifies and returns 'str'
> 
>   char[] getlower(string str);  //returns a copy

When one would use a mutating tolower, one is already manipulating the contents of a string character by character. In such cases, one can tolower the characters in that process, instead of doing it later (the former will be more efficient anyway, and the only advantage to a mutating tolower is an efficiency improvement).

Using the functional-style copy-on-write string functions will result in easy to understand, less buggy programs. Doing strings in this manner is a proven success in just about every programming language.
July 05, 2007
On Thu, 05 Jul 2007 11:51:30 -0700, Walter Bright wrote:

> Derek Parnell wrote:
>> I'll make my code example more free from assumed functionality.
>> 
>>  char[] qwerty;
>> 
>>  qwerty = KJHGF(poiuy).dup;
>>  version(xyzzy)
>>  {
>>      MNBVC(qwerty);
>>  }
>> 
>> As you can see, my point is made without regard to converting stuff to lower case.
> 
> My point is that the way the snippet is written is inside out. Do not use .dup to preemptively make a copy in case it gets changed somewhere later one. The style is to make a .dup *only if* the contents will be changed and do the .dup *at the site* of the modification.
> 
> In other words, dups should be done from the bottom up, not from the top down.
> 
> I think such a style helps fit things together nicely and avoids strange .dups appearing in inexplicable places.

Thanks. This is what I meant by taking rethinking the design of my routines. I'll strongly consider your suggestion even though it does complicate the algorirhm for readers of the code.

-- 
Derek Parnell
Melbourne, Australia
skype: derek.j.parnell
July 05, 2007
Reply to Walter,

> Derek Parnell wrote:
> 
>> I'll make my code example more free from assumed functionality.
>> 
>> char[] qwerty;
>> 
>> qwerty = KJHGF(poiuy).dup;
>> version(xyzzy)
>> {
>> MNBVC(qwerty);
>> }
>> As you can see, my point is made without regard to converting stuff
>> to lower case.
>> 
> My point is that the way the snippet is written is inside out. Do not
> use .dup to preemptively make a copy in case it gets changed somewhere
> later one. The style is to make a .dup *only if* the contents will be
> changed and do the .dup *at the site* of the modification.
> 
> In other words, dups should be done from the bottom up, not from the
> top down.
> 
> I think such a style helps fit things together nicely and avoids
> strange .dups appearing in inexplicable places.
> 


The one issue I can see with this is where an input is const but may be changed (and .duped) at any of a number of points. The data though only needs to be .duped once.

|char[] Whatever(const char[] str)
|{
| if(c1) str = Mod1(str.dup);
| if(c2) str = Mod2(str.dup);
| if(c3) str = Mod3(str.dup);
| return str;
|}
// causes exces duping


I can't think of a better solution than this (and this is BAD):

|char[] Whatever(const char[] str)
|{
| sw: switch(-1)
| {
|  foreach(bool b; T!(true, false))
|  {
|   if(c1) {static if(b){str = str.dup; goto case 1;} else {case 1:  str = Mod1(str.dup);}}
|   if(c2) {static if(b){str = str.dup; goto case 2;} else {case 2:  str = Mod2(str.dup);}}
|   if(c3) {static if(b){str = str.dup; goto case 3;} else {case 3:  str = Mod3(str.dup);}}
|   return str;
|  }
| }
|}