V2 string (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » V2 string (page 2)

July 05, 2007

Posted by Regan Heath
in reply to Derek Parnell

Regan Heath

Posted in reply to Derek Parnell

Derek Parnell Wrote:
> On Wed, 04 Jul 2007 15:48:45 -0700, Walter Bright wrote:
> 
> > Derek Parnell wrote:
> >> I'm converting Bud to compile using V2 and so far its been a very hard thing to do. I'm finding that I'm now having to use '.dup' and '.idup' all over the place, which is exactly what I thought would happen. Bud does a lot of text manipulation so having 'string' as invariant means that calls to functions that return string need to often be .dup'ed because I need to assign the result to a malleable variable.
> >> 
> >> I might have to rethink of the design of the application to avoid the performance hit of all these dups.
> >> 
> > 
> > First of all, if you were returning string literals as char[] and trying to manipulate them, they'd fail on linux at run time (because string literals are put into read only segments).
> 
> But I'm not, and never have been, returning string literals anywhere.
> 
> > Second, you can use char[] instead of string.
> 
> The idiom I'm using is that functions that receive text have those parameters as 'string' to guard against the function inadvertantly modifying that which is passed

Yep, makes sense.

> , and functions that return text return
> 'string' to guard against calling functions inadvertantly modifying data
> that they did not create (own).

Question;  Do these functions keep a copy of the returned string?  Or, to re-phrase, after returning the string do they still 'own' it, or have they washed their hands of it?  Are they in a sense passing ownership to the calling function perhaps?

If they no longer 'own' the string then they can return it as a char[] instead of string and all your problems are solved, right?

I imagine that if they return a slice of the input string, and that string was 'string' not char[] then they would also return string (because doing otherwise would be claiming ownership of the input string and giving it away to the caller, which may not be valid)

Maybe you have a lot of functions returning slices to the input string?

Maybe you need to template them? i.e.

T function(T)(T param)
{
}

so if you pass string you get string, if you pass char[] you get char[].

Maybe all string routines which return slices of the input should be so templated?

Regan

July 05, 2007

Posted by Regan Heath
in reply to Walter Bright

Regan Heath

Posted in reply to Walter Bright

Walter Bright Wrote:
> Derek Parnell wrote:
> > However, if I might need to update it ...
> > 
> >    char[] fullpath;
> > 
> >    fullpath = CanonicalPath(shortname).dup;
> >    version(Windows)
> >    {
> >       setLowerCase(fullpath);
> >    }
> > 
> > The point is that the 'CanonicalPath' function hasn't got a clue what the calling function is intending to do with the result so it is trying to be responsible by guarding it against mistakes by the caller.
> 
> If you write it like this:
> 
> string fullpath;
> 
> fullpath = CanonicalPath(shortname);
> version(Windows)
> {
>        fullpath = std.string.tolower(fullpath);
> }
> 
> you won't need to do the .dup .

Because tolower does it for you, but it still returns string and if for example you need to add something to the end of the path, like a filename you will end up doing yet another dup somewhere.

I think the solution may be to template all functions which return the input string, or part of the input string, eg.

T tolower(T)(T input)
{
}

That way if you call it with char[] you get a char[] back, if you call it with string you get a string back.

However...

tolower is an interesting case.  As a caller I expect it to modify the string, or perhaps give a modified copy back (both options are valid and should perhaps be supported?).

So, the 'string tolower(string)' version has 2 cases, the first case where it doesn't need to modify the input and can simply return it, no problem.

But case 2, where it does modify it should dup and return char[].  My reasoning being that after it has completed and returned the copy, the caller now 'owns' the string (as it's the only copy in existance and no-one else has a reference to it).

To achieve that we'd need to overload on return type, or something clever...  but then, how do we call it?

auto s = tolower(input);

tolower cannot be selected at compile time, and the type of s cannot be known either, so that's an impossible situation, yes?

Regan

July 05, 2007

Posted by Bruno Medeiros
in reply to Derek Parnell

Bruno Medeiros

Posted in reply to Derek Parnell

Derek Parnell wrote:
> On Wed, 04 Jul 2007 15:48:45 -0700, Walter Bright wrote:
> 
> The idiom I'm using is that functions that receive text have those
> parameters as 'string' to guard against the function inadvertantly
> modifying that which is passed, and functions that return text return
> 'string' to guard against calling functions inadvertantly modifying data
> that they did not create (own).
> 
> This leads to constructs like ...
> 
>    char[] result;
> 
>    result = SomeTextFunc(data).dup;
> 
> Another commonly used idiom that I had to stop using was ...
> 
>    char[] text;
>    text = getvalue();
>    if (wrongvalue(text))
>        text = ""; // Reset to an empty string
> 
> I now code ...
> 
>        text.length = 0; // Reset to an empty string
> 
> which is slightly less readable.
> 


Why is 'text.length = 0;' or 'text = text.init;' better than the idiom:
  str = "".dup;
, which also works for any kind of string, not just empty strings?

I found however, that there is a bug with that code:
http://d.puremagic.com/issues/show_bug.cgi?id=1314

-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

July 05, 2007

Posted by Bruno Medeiros
in reply to Regan Heath

Bruno Medeiros

Posted in reply to Regan Heath

Regan Heath wrote:
> Walter Bright Wrote:
>> Derek Parnell wrote:
>>> However, if I might need to update it ...
>>>
>>>    char[] fullpath;
>>>
>>>    fullpath = CanonicalPath(shortname).dup;
>>>    version(Windows)
>>>    {
>>>       setLowerCase(fullpath);
>>>    }
>>>
>>> The point is that the 'CanonicalPath' function hasn't got a clue what the
>>> calling function is intending to do with the result so it is trying to be
>>> responsible by guarding it against mistakes by the caller.
>> If you write it like this:
>>
>> string fullpath;
>>
>> fullpath = CanonicalPath(shortname);
>> version(Windows)
>> {
>>        fullpath = std.string.tolower(fullpath);
>> }
>>
>> you won't need to do the .dup .
> 
> Because tolower does it for you, but it still returns string and if for example you need to add something to the end of the path, like a filename you will end up doing yet another dup somewhere.
> 
> I think the solution may be to template all functions which return the input string, or part of the input string, eg.
> 
> T tolower(T)(T input)
> {
> }
> 
> That way if you call it with char[] you get a char[] back, if you call it with string you get a string back.
> 
> However...
> 
> tolower is an interesting case.  As a caller I expect it to modify the string, or perhaps give a modified copy back (both options are valid and should perhaps be supported?).
> 
> So, the 'string tolower(string)' version has 2 cases, the first case where it doesn't need to modify the input and can simply return it, no problem.  
> 
> But case 2, where it does modify it should dup and return char[].  My reasoning being that after it has completed and returned the copy, the caller now 'owns' the string (as it's the only copy in existance and no-one else has a reference to it).
> 

Indeed, I think this illustrates that some standard library functions may not have the correct signature, and I tolower is likely one of them.
The most general case for tolower is:
  char[] tolower(const(char)[] s);
Since tolower creates a new array, but does not keep it, it can give away it's ownership of the the array (ie, return a mutable).

The second case, more specific, is simply syntactic sugar for making that array invariant:

  invariant(char)[] tolowerinv(const(char)[] str) {
    return cast(invariant) tolower(str);
  }

The current signature:
  const(char)[] tolower(const(char)[] str)
is kinda incorrect, because it returns a const reference for an array that has no mutable references, and that is the same as an invariant reference, so tolower might as well return invariant(char)[].


> To achieve that we'd need to overload on return type, or something clever...  but then, how do we call it?
> 
> auto s = tolower(input);
> 
> tolower cannot be selected at compile time, and the type of s cannot be known either, so that's an impossible situation, yes?
> 
> Regan

The 'something clever' to distinguish both cases is simply naming two different functions, like tolower or tolowerinv (if the second function is needed at all).


-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

July 05, 2007

Posted by Derek Parnell
in reply to Walter Bright

Derek Parnell

Posted in reply to Walter Bright

On Thu, 05 Jul 2007 00:42:25 -0700, Walter Bright wrote:

> Derek Parnell wrote:
>> The idiom I'm using is that functions that receive text have those parameters as 'string' to guard against the function inadvertantly modifying that which is passed, and functions that return text return 'string' to guard against calling functions inadvertantly modifying data that they did not create (own).
>> 
>> This leads to constructs like ...
>> 
>>    char[] result;
>> 
>>    result = SomeTextFunc(data).dup;
> 
> If you're needing to guard against inadvertent modification, that's just what const strings are for. I'm not understanding the issue here.

There is no issue. I'm not raising an issue. I'm just making some observations about my exerience so far in moving to V2.

I'm not surprised by the effort that I'm having. I expected it. Why? Because I knew that most of the strings I work with are text (mutable things) and by using the D 'string', an immutable thing, for function signatures was going to mean I'd have to changes things to suit.

I choose to use 'string' it safe guard myself from making stupid errors in coding. And its working. My next pass through the application code will be to find places where I can safely return a 'text' thing instead of a 'string' thing, which is a performance turning exercise.

>> Another commonly used idiom that I had to stop using was ...
>> 
>>    char[] text;
>>    text = getvalue();
>>    if (wrongvalue(text))
>>        text = ""; // Reset to an empty string
>> 
>> I now code ...
>> 
>>        text.length = 0; // Reset to an empty string
>> 
>> which is slightly less readable.
> 
> This should do it nicely:
> 
> 	text = null;

Not really. I want an empty text and not a non-text. Also, it doesn't fit right with other data types - the consistency thing again.

   text = typeof(text).init;

works better for me because I can also use this construct in templates without problems.

But really, this thread can die now. I didn't mean to go off into weird tangental subects.

-- 
Derek Parnell
Melbourne, Australia
skype: derek.j.parnell

July 05, 2007

Posted by Derek Parnell
in reply to Walter Bright

Derek Parnell

Posted in reply to Walter Bright

On Thu, 05 Jul 2007 01:06:45 -0700, Walter Bright wrote:

> Derek Parnell wrote:
>> However, if I might need to update it ...
>> 
>>    char[] fullpath;
>> 
>>    fullpath = CanonicalPath(shortname).dup;
>>    version(Windows)
>>    {
>>       setLowerCase(fullpath);
>>    }
>> 
>> The point is that the 'CanonicalPath' function hasn't got a clue what the calling function is intending to do with the result so it is trying to be responsible by guarding it against mistakes by the caller.
> 
> If you write it like this:
> 
> string fullpath;
> 
> fullpath = CanonicalPath(shortname);
> version(Windows)
> {
>        fullpath = std.string.tolower(fullpath);
> }
> 
> you won't need to do the .dup .

If you have any failing Walter, its your ability to focus on insignifacnt minutia as a form of distraction from the point that people are really trying to make.

I was not talking about how to do efficient lower case conversion.

I'll make my code example more free from assumed functionality.

 char[] qwerty;

 qwerty = KJHGF(poiuy).dup;
 version(xyzzy)
 {
     MNBVC(qwerty);
 }

As you can see, my point is made without regard to converting stuff to lower case.

-- 
Derek Parnell
Melbourne, Australia
skype: derek.j.parnell

July 05, 2007

Posted by Frits van Bommel
in reply to Bruno Medeiros

Frits van Bommel

Posted in reply to Bruno Medeiros

Bruno Medeiros wrote:
> Regan Heath wrote:
>> tolower is an interesting case.  As a caller I expect it to modify the string, or perhaps give a modified copy back (both options are valid and should perhaps be supported?).
>>
>> So, the 'string tolower(string)' version has 2 cases, the first case where it doesn't need to modify the input and can simply return it, no problem. But case 2, where it does modify it should dup and return char[].  My reasoning being that after it has completed and returned the copy, the caller now 'owns' the string (as it's the only copy in existance and no-one else has a reference to it).
>>
> 
> Indeed, I think this illustrates that some standard library functions may not have the correct signature, and I tolower is likely one of them.
> The most general case for tolower is:
>   char[] tolower(const(char)[] s);
> Since tolower creates a new array, but does not keep it, it can give away it's ownership of the the array (ie, return a mutable).

Sorry, but you seem to have missed a bit above: if the string doesn't contain any uppercase characters tolower returns the input without .dup-ing it (aka copy-on-write).

> The second case, more specific, is simply syntactic sugar for making that array invariant:
> 
>   invariant(char)[] tolowerinv(const(char)[] str) {
>     return cast(invariant) tolower(str);
>   }

Yes, but only if it actually needs to modify the string.

You seem to have missed that the two cases can't (in general) be distinguished at compile time; it's only at run time when a choice is made between a copy and no copy.

> The current signature:
>   const(char)[] tolower(const(char)[] str)
> is kinda incorrect, because it returns a const reference for an array that has no mutable references, and that is the same as an invariant reference, so tolower might as well return invariant(char)[].

Again, that only holds if a copy was actually made at run time. If no copy was made the original input is returned, to which there may be mutable references.

July 05, 2007

Posted by Regan Heath
in reply to Bruno Medeiros

Regan Heath

Posted in reply to Bruno Medeiros

Bruno Medeiros wrote:
> The 'something clever' to distinguish both cases is simply naming two different functions, like tolower or tolowerinv (if the second function is needed at all).

I was hoping for something clever'er ;)

Regan

July 05, 2007

Posted by Oskar Linde
in reply to Derek Parnell

Oskar Linde

Posted in reply to Derek Parnell

Derek Parnell wrote:

> I'll make my code example more free from assumed functionality.
> 
> 
>  char[] qwerty;
>   qwerty = KJHGF(poiuy).dup;
>  version(xyzzy)
>  {
>      MNBVC(qwerty);
>  }
> 
> As you can see, my point is made without regard to converting stuff to
> lower case.

What you are doing there is mixing two styles of functions. Functional (KJHGF) and in-place modifying functions (MNBVC). Walter's modification was making both use a common style (functional).

Mixing those two function styles will naturally require different types of constness.

-- 
Oskar

July 05, 2007

Posted by Bruno Medeiros
in reply to Frits van Bommel

Bruno Medeiros

Posted in reply to Frits van Bommel

Frits van Bommel wrote:
> Bruno Medeiros wrote:
>> Regan Heath wrote:
>>> tolower is an interesting case.  As a caller I expect it to modify the string, or perhaps give a modified copy back (both options are valid and should perhaps be supported?).
>>>
>>> So, the 'string tolower(string)' version has 2 cases, the first case where it doesn't need to modify the input and can simply return it, no problem. But case 2, where it does modify it should dup and return char[].  My reasoning being that after it has completed and returned the copy, the caller now 'owns' the string (as it's the only copy in existance and no-one else has a reference to it).
>>>
>>
>> Indeed, I think this illustrates that some standard library functions may not have the correct signature, and I tolower is likely one of them.
>> The most general case for tolower is:
>>   char[] tolower(const(char)[] s);
>> Since tolower creates a new array, but does not keep it, it can give away it's ownership of the the array (ie, return a mutable).
> 
> Sorry, but you seem to have missed a bit above: if the string doesn't contain any uppercase characters tolower returns the input without ..dup-ing it (aka copy-on-write).
> 

Oops, sorry, that's right, I missed that part about tolower not
modifying the string if it wasn't necessary. :(


>> The second case, more specific, is simply syntactic sugar for making that array invariant:
>>
>>   invariant(char)[] tolowerinv(const(char)[] str) {
>>     return cast(invariant) tolower(str);
>>   }
> 
> Yes, but only if it actually needs to modify the string.
> 
> You seem to have missed that the two cases can't (in general) be distinguished at compile time; it's only at run time when a choice is made between a copy and no copy.
> 
>> The current signature:
>>   const(char)[] tolower(const(char)[] str)
>> is kinda incorrect, because it returns a const reference for an array that has no mutable references, and that is the same as an invariant reference, so tolower might as well return invariant(char)[].
> 
> Again, that only holds if a copy was actually made at run time. If no copy was made the original input is returned, to which there may be mutable references.

You're right, if a copy is not made *every* time (which is the case
after all), then the above doesn't hold.
But then, what I think is happening is that Phobo's current tolower is
suboptimal in terms of usefulness, because the fact that we don't know
if a new copy is made or not. I'm wondering now what would be the more
useful form, or forms, of tolower (and similar functions) to have.
Now that I think of it again (admittedly I haven't got much experience with string manipulation in C++ or D, though), but perhaps the best form is an in-place mutable version:
  char[] tolower(char[] str);
And it's this one after all that is the most general form. If you want to call tolower on a const or invariant array you dup it yourself on the call:
  char[] str = tolower("FOO".dup);


-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation