Thread overview
Copy-on-Write using toupper() source example?
Mar 16, 2005
AEon
Mar 16, 2005
AEon
Mar 17, 2005
Ilya Minkov
Mar 25, 2005
Georg Wrede
Mar 25, 2005
AEon
March 16, 2005
To start off this newsgroup:

Section: Strings (and Array) Copy-on-Write (http://www.digitalmars.com/d/memory.html)

Trying to fully understand the toupper() example. Most of the questions below are more rhetorical, but it would be helpful to have my "understanding" verified.


Copy-on-write version

<code>
   char[] toupper(char[] s)
   {
       int changed = 0;
       for (int i = 0; i < s.length; i++)
       {
           char c = s[i];
           if ('a' <= c && c <= 'z')
           {
               if (!changed)
               {
                   char[] r = new char[s.length];
                   r[] = s;
                   s = r;
                   changed = 1;
               }
               s[i] = c - (cast(char)'a' - 'A');
           }
       }
       return s;
   }
</code>


1. If I understand this corretly: Should a lower case letter be encountered (s would need to be changed), a new array r is allocated in memory, the contents of s it copied into it, and then s points to the copy, thus not changing the original "char[] s" that was handed to the function?

If so, ok, so you do not mess up your original values.


2. Does one really need to use the "new" command?

   char[] r = new char[s.length];  // non-local array
   r[] = s;

would not the following be equivalent?

   char[s.length] r;               // local array, would be freed aber leavin
function
   r = s.dup;

This is more retorical but makes me wonder. "new" is required since you need to allocate non-local memory?


3. I am only beginning to learn about the handling of strings, but how come

   r[] = s;

is used, would not

   r = s;

do the same? Or is this (r[] = s;) required since r is a "new" created array? Or is "r[] =" a notation for data copy?


4. As I understand:

   char[] r = new char[s.length];
   r[] = s;
   s = r;

- Create a new "global" dynamic array r with the length of s. - Copy the contents (data) of s to the new r array. - Point s (reference?) to r? Is that even legal?

s would then point to another part of memory?

Would r not circumvent garbage collection?

When would the memory for r be freed? Would one need to do that manually?

I am just imagining, this function is called thousands of times, and all those "r" arrays clogging up memory? In ANSI C I would have defined one global temp (scratch) variable, that is used in all such cases.


Thanx.

AEon
March 16, 2005
"AEon" <AEon_member@pathlink.com> wrote in message news:d1a5cj$22pp$1@digitaldaemon.com...
> 1. If I understand this corretly: Should a lower case letter be
> encountered (s
> would need to be changed), a new array r is allocated in memory, the
> contents of
> s it copied into it, and then s points to the copy, thus not changing the
> original "char[] s" that was handed to the function?
> If so, ok, so you do not mess up your original values.

Yep.

> 2. Does one really need to use the "new" command?
>   char[] r = new char[s.length];  // non-local array
>   r[] = s;

Yep.

> would not the following be equivalent?
>   char[s.length] r;               // local array, would be freed aber
> leavin
> function
>   r = s.dup;

Writing char[s.length] causes an error, for some reason.  Hence, the "new" expression in the original function.

However, you can write

char[] r;
r=s.dup;

As .dup duplicates the entire array, length data and all.  So this would be equivalent to the original.

> 3. I am only beginning to learn about the handling of strings, but how come
>
>   r[] = s;
>
> is used, would not
>
>   r = s;
>
> do the same? Or is this (r[] = s;) required since r is a "new" created
> array? Or
> is "r[] =" a notation for data copy?

You got it.  r=s would make r point to the same array as s.  r[]=s copies s's data into r's memory.

> 4. As I understand:
>
>   char[] r = new char[s.length];
>   r[] = s;
>   s = r;
>
> - Create a new "global" dynamic array r with the length of s.
> - Copy the contents (data) of s to the new r array.
> - Point s (reference?) to r? Is that even legal?

Oh of course.

> s would then point to another part of memory?

s points to the same place r points.  Arrays are pretty much just structs in D - they have a length and a pointer to their data.  When you write "s=r", you're only copying those two things - the length and the pointer - to s.

> Would r not circumvent garbage collection?
> When would the memory for r be freed? Would one need to do that manually?

Hopefully r's memory is not freed, as that is what is returned from the function.

> I am just imagining, this function is called thousands of times, and all
> those
> "r" arrays clogging up memory? In ANSI C I would have defined one global
> temp
> (scratch) variable, that is used in all such cases.

Actually it has to do with what you do with r once it's returned from the function.  If you do something like this:

char[][] strings;
// .. strings is filled somehow
for(int i=0; i<strings.length; i++)
{
    strings[i]=toUpper(strings[i]);
}

Here's what happens.

strings[i] points to some place in memory.  You pass it into toUpper, where it is now called s.  Say it does have to be copied.  What is returned is a copy of s, a.k.a. strings[i].  So strings[i] is so far unaffected.  However, the statement in the for() loop assigns the result of the function back into strings[i].  So strings[i] now points to the copy of strings[i], and what used to be pointed to by strings[i] is released by the GC.  So the new, uppercase string is created and the old string is destroyed - no memory loss.

I've found that it's a lot easier not to try to understand the GC, but to take advantage of it ;)  When you're not sure about some memory management issue, remember - because of the GC, it really IS that easy.


March 16, 2005
Jarrett Billingsley wrote...

Thanx for the detailed info. I am glad I understood most the things properly.


>> 2. Does one really need to use the "new" command?
>>   char[] r = new char[s.length];
>>   r[] = s;

>However, you can write
>    char[] r;
>    r=s.dup;

Should have occurred to me (i.e. dropping the explicit sizing "s.length"). But it does surprise me that the above 2 lines are equivalent to the "new" version. I would have sworn that the latter is just a local array declaration that is lost as soon as you leave the function.


Would a reduction to

    char[] r=s.dup;

also work?


>Actually it has to do with what you do with r once it's returned from the function.  If you do something like this:
>
>char[][] strings;
>// .. strings is filled somehow
>for(int i=0; i<strings.length; i++)
>{
>    strings[i]=toUpper(strings[i]);
>}


>Here's what happens.
>
>strings[i] points to some place in memory.  You pass it into toUpper, where it is now called s.  Say it does have to be copied.  What is returned is a copy of s, a.k.a. strings[i].  So strings[i] is so far unaffected.  However, the statement in the for() loop assigns the result of the function back into strings[i].

>So strings[i] now points to the copy of strings[i], and what used to be pointed to by strings[i] is released by the GC.  So the new, uppercase string is created and the old string is destroyed - no memory loss.

Hopefully :)... but I see what you mean. Coming from ANSI C, I am very suspicious of any memory management, since if you did not do the right thing back then your tools would leak memory (or do worse things). If I am not mistaken, in ANSI C, the example above would indeed lead to a leak, since you have overwritten the original pointer to string[i], thus no longer are able to access that original piece of allocated memory?


>I've found that it's a lot easier not to try to understand the GC, but to take advantage of it ;)  When you're not sure about some memory management issue, remember - because of the GC, it really IS that easy.

:)... I guess I just need to start recoding my stats in D, and then after seeing my code *not* crash... then I will sit back and enjoy GC ;)

BTW, when I first looked at C++, hoping such things would be done for me, I was badly disappointed that all the things I missed in C where not present in C++, and many things actually got worse. I just can't believe how great D is... sigh... *happiness* :)

AEon
March 17, 2005
AEon wrote:
> Should have occurred to me (i.e. dropping the explicit sizing "s.length"). But it does surprise me that the above 2 lines are equivalent to the "new" version. I would have sworn that the latter is just a local array declaration that is lost as soon as you leave the function.

First rule about swearing: don't. ;>

type[length] array1; //fixed-length stack array
type[] array2: //dynamic array.

In the case of dynaic array, a structure of a pointer to the beginning of the array slice and a length of this slice is placed right where you declare it, i.e. here in a local variable space, on a stack.

Then you allocate a dynamic array and assign it to this local variable, this variable can be copied around, returned, etc.

This is in contrast to a static array. When passing and under other circumstances, such stuff becomes converted into a pointer to the memory region (+ optionally length), which is soon no longer valid because the memory region lies on the stack frame which will already be destroyed after you return from the function. Exactly like in C.

> Would a reduction to
> 
>     char[] r=s.dup;
> 
> also work?

If i'm not missing anything out, yes.

However, it was probably decided against .dup because it makes one extra step: namely it populates the new array with a copy of an old one. I wouldn't usually care about that though, this is among things an optimizer would handle someday.

>>Actually it has to do with what you do with r once it's returned from the function.  If you do something like this:
>>
>>char[][] strings;
>>// .. strings is filled somehow
>>for(int i=0; i<strings.length; i++)
>>{
>>   strings[i]=toUpper(strings[i]);
>>}
> 
> 
> 
>>Here's what happens.
>>
>>strings[i] points to some place in memory.  You pass it into toUpper, where it is now called s.  Say it does have to be copied.  What is returned is a copy of s, a.k.a. strings[i].  So strings[i] is so far unaffected.  However, the statement in the for() loop assigns the result of the function back into strings[i].  
> 
> 
>>So strings[i] now points to the copy of strings[i], and what used to be pointed to by strings[i] is released by the GC.  So the new, uppercase string is created and the old string is destroyed - no memory loss.
> 
> 
> Hopefully :)... but I see what you mean. Coming from ANSI C, I am very suspicious of any memory management, since if you did not do the right thing back then your tools would leak memory (or do worse things). If I am not mistaken, in ANSI C, the example above would indeed lead to a leak, since you have overwritten the original pointer to string[i], thus no longer are able to access that original piece of allocated memory?

Generally, not, but usually and practically, yes. Function toUpper gets a string slice stucture as argument by value, i.e. a copy of in its local stack. It reassigns this internal copy to a new value. Then it returns it.

The calling function would then be responsible to keep an old pointer (actually slice structure in D), compare it with the new one, and if the old one differs and it doesn't need it any longer, free it. However, that's quite a cumbersome way of doing business, so usually and practically implementing a function this way is likely to make an error.

> :)... I guess I just need to start recoding my stats in D, and then after seeing my code *not* crash... then I will sit back and enjoy GC ;)

To sit back and enjoy means to nearly abandon delete. Only use explicit memory management on well-encapsulated resources.

> BTW, when I first looked at C++, hoping such things would be done for me, I was badly disappointed that all the things I missed in C where not present in C++, and many things actually got worse. I just can't believe how great D is... sigh... *happiness* :)

C++ doesn't have language means for nearly anything, it is only sensible in conjunction with libraries. For example, std::string has value semantics - thus you hardly ever need to manage pointers to it. Firthermore, it is usually implemented as a reference-counted pointer with copy-on-write semantics: not particularly performance-gaining, but memory-efficient. For passing pointers up and down the stack with proper clean-up one uses auto_ptr<>, and for a general case of memory management shared_ptr<>, which is almost analogous to D's GC - though slower. All but shared_ptr are in the standard library, shared_prt is in boost, and will be probably integrated in the standard library later. Very nice are also containers, including an easily growing array replacement, vector<>. Or well, at least /me and many other people find prgramming in C++ with the right libraries mildly enjoyable, but not in C.

-eye
March 25, 2005
AEon wrote:
> :)... I guess I just need to start recoding my stats in D, and then after seeing my code *not* crash... then I will sit back and enjoy GC ;)

I guess that sums up something every GC-phobic ought to read the first day! (Even I.)

> BTW, when I first looked at C++, hoping such things would be done for me, I was badly disappointed that all the things I missed in C where not present in C++, and many things actually got worse. I just can't believe how great D is... sigh... *happiness* :)

Man, amen to that!

Walter should really start collecting these statements! And put them on the D startup page!!!!

Hmm. Or maybe we should do it for him?  :-)
March 25, 2005
Georg Wrede wrote:
> AEon wrote:
> 
>> :)... I guess I just need to start recoding my stats in D, and then after seeing my code *not* crash... then I will sit back and enjoy GC ;)
> 
> I guess that sums up something every GC-phobic ought to read the first day! (Even I.)

Well I would not call myself a GC-phobic. It's more like I never "found" any language that would have been fast enough for me to really appreciate GC.  This has probably to do with the fact (IIRC) that most of the time GC is used by interpreter languages like Basic, many Script languages (ARexx, Perl, PHP) and all those are just plain slow, compared to something like C.

>> BTW, when I first looked at C++, hoping such things would be done for me, I was badly disappointed that all the things I missed in C where not present in C++, and many things actually got worse. I just can't believe how great D is... sigh... *happiness* :)
> 
> Man, amen to that!
> 
> Walter should really start collecting these statements! And put them on the D startup page!!!!
> 
> Hmm. Or maybe we should do it for him?  :-)

:)...

Presently I have the feeling the std.string lib is actually more powerful and more useful then the libs PHP uses for strings. Every time I was looking for a command to do something slightly out of the ordinary, a std.string command did exacty what I had hoped, this IIRC was not the case for PHP. If I am not mistaken, thsi would mean D is beating PHP on it's own turf :)

AEon