string comparison (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » string comparison (page 2)

December 20, 2010

Re: string comparison

Posted by Steven Schveighoffer
in reply to Stanislav Blinov

Steven Schveighoffer

Posted in reply to Stanislav Blinov

On Mon, 20 Dec 2010 11:13:34 -0500, Stanislav Blinov <blinov@loniir.ru> wrote:

> And lastly, hasn't this by chance been your first post? AFAIR, the first message is being moderated so it doesn't get to the public at once.

BTW, this message board is not moderated.

-Steve

December 20, 2010

Re: string comparison

Posted by Steven Schveighoffer
in reply to Steven Schveighoffer

Steven Schveighoffer

Posted in reply to Steven Schveighoffer

On Mon, 20 Dec 2010 14:05:56 -0500, Steven Schveighoffer <schveiguy@yahoo.com> wrote:

> On Mon, 20 Dec 2010 11:13:34 -0500, Stanislav Blinov <blinov@loniir.ru> wrote:
>
>> And lastly, hasn't this by chance been your first post? AFAIR, the first message is being moderated so it doesn't get to the public at once.
>
> BTW, this message board is not moderated.

I should clarify, it's retroactively moderated :)  That is, if spam appears, it's allowed to go through, but then removed once discovered.

-Steve

December 20, 2010

Re: string comparison

Posted by doubleagent
in reply to Jonathan M Davis

doubleagent

Posted in reply to Jonathan M Davis

> The reason that std.string.splitter() does not show in the documentation is that its return type is auto, and there is currently a bug in ddoc that makes it so that auto functions don't end up in the generated documentation. Looking at the code, it pretty much just forwards to std.algorithm.splitter() using whitespace as its separator, so you can look at the documentation there if you'd like.

Thanks.  The code was pretty self-explanatory but it's helpful to know that auto functions currently don't get documented.

December 20, 2010

Re: string comparison

Posted by Jonathan M Davis
in reply to doubleagent

Jonathan M Davis

Posted in reply to doubleagent

On Monday, December 20, 2010 10:44:12 doubleagent wrote:
> > Are you 100% sure that you are running this version
> 
> I have to be.  There are no other versions of phobos on this box and 'which dmd' points to the correct binary.
> 
> >  dictionary[word.idup] = newId;
> 
> That fixes it.
> 
> > The 'word' array is mutable and reused by byLine() on each iteration.  By doing the above you use an immutable copy of it as the key instead.
> 
> I REALLY don't understand this explanation.  Why does the mutability of 'word' matter when the associative array 'dictionary' assigns keys by value...it's got to assign them by value, right?  Otherwise we would only get one entry in 'dictionary' and the key would be constantly changing.

Okay. I don't know what the actual code looks like, but word is obviously a dynamic array, and if it's from byLine(), then that dynamic array is mutable - both the array itself and its elements. Using idup gets you an immutable copy. When copying dynamic arrays, you really get a slice of that array. So, you get an array that points to the same array as the original. Any changes to the elements in one affects the other. If you append to one of them and it doesn't have the space to resize in place or dyou o anything else which could cause it to reallocate, then that array is reallocated and they no longer point to the same data and changing will not change the other.

If the elements of the array are const or immutable, then the fact that the two arrays point to the same data isn't a problem because the elements can't be changed (except in cases where you'red dealing with const rather than immutable and another array points to the same data but doesn't have const elements). So, assigning one string to another, for instance (string being an alias for immutable(char)[]), will never result in one string altering another. However, if you're dealing with char[] rather than string, one array _can_ affect the elements of another. I believe that byLine() deals with a char[], not a string.

Now, as for associative arrays, they don't really deal with const correctly. I believe that they're actually implemented with void* and you can actually do things like put const elements in them in spite of the fact that toHash() on Object is not currently const (there is an open bug on the fact that Object is not const-correct). So, it does not surprise me in the least if it will take mutable types as its key and then allow them to be altered (assuming that they're pointers or reference types and you can therefore have other references to them). But to fix the problem in this case would require immutability rather than const, because you're dealing with a reference type (well, pseudo-reference type since dynamic arrays share their elements such that changes to their elements affect all arrays which point to those elements, but other changes - such as altering their length don't affect other arrays and will even likely result in the arrays then being completely separate).

> The behavior itself seems really unpredictable prior to testing, and really unintended after testing.  I suspect it's due to some sort of a bug.  The program, on my box anyway, only fails when we give it identical strings, except one is prefixed with a space.  That should tell us that 'splitter' and 'strip' didn't do their job properly.  The fly in the ointment is that when we output the strings, they appear as we would expect.
> 
> I suspect D does string comparisons (when the 'in' keyword is used) based
> on some kind of a hash, and that hash doesn't get correctly updated when
> 'strip' or 'splitter' is applied, or upon the next comparison or whatever.
>  Calling 'idup' must force the hash to get recalculated.  Obviously, you
> guys would know if there's any merit to this, but it seems to explain the
> problem.

in should use toHash() (or whatever built-in functions for built-in types if you're not dealing with a struct or class) followed by ==. I'd be stunned if there were any caching involved. The problem is that byLine() is using a mutable array, so the elements pointed to by the array that you just put in the associative array changed, which means that the hash for them is wrong, and == will fail when used to compare the array to what it was before.

> > The advantage with splitter is that it is lazy and therefore more efficient.  split() is eager and allocates memory to hold the string fragments.
> 
> Yeah, that's what I thought would be the answer.  Kudos to you guys for thinking of laziness out of the box.  This is a major boon for D.
> 
> You know, there's something this touches on which I was curious about.  If D defaults to 'safety first', and with some work you can get down-to-the-metal, why doesn't the language default to immutable variables, with an explicit modifier for mutable ones?  C compatibility?

C compatability would be one reason. Familiarity would be another. Also, it would be _really_ annoying to have to mark variables mutable all over the place as you would inevitably have to do. The way that const and immutable are designed in D, to some extent, you can pretty much ignore them if you don't want to use them, which some folks like Andrei deem important. Making immutable the default would force it on everyone.

- Jonathan M Davis

December 20, 2010

Re: string comparison

Posted by Lars T. Kyllingstad
in reply to doubleagent

Lars T. Kyllingstad

Posted in reply to doubleagent

On Mon, 20 Dec 2010 18:44:12 +0000, doubleagent wrote:

>> Are you 100% sure that you are running this version
> 
> I have to be.  There are no other versions of phobos on this box and 'which dmd' points to the correct binary.
> 
>>  dictionary[word.idup] = newId;
> 
> That fixes it.
> 
>> The 'word' array is mutable and reused by byLine() on each iteration. By doing the above you use an immutable copy of it as the key instead.
> 
> I REALLY don't understand this explanation.  Why does the mutability of 'word' matter when the associative array 'dictionary' assigns keys by value...it's got to assign them by value, right?  Otherwise we would only get one entry in 'dictionary' and the key would be constantly changing.

This could be related to bug 2954, for which a fix will be released in the next version of DMD.

  http://d.puremagic.com/issues/show_bug.cgi?id=2954

-Lars

December 21, 2010

Re: string comparison

Posted by doubleagent
in reply to Jonathan M Davis

doubleagent

Posted in reply to Jonathan M Davis

> Okay. I don't know what the actual code looks like

Here.

import std.stdio, std.string;

void main() {
        uint[string] dictionary; // v[k], so string->uint
        foreach (line; stdin.byLine()) {
                // break sentence into words
                // Add each word in the sentence to the vocabulary
                foreach (word; splitter(strip(line))) {
                        if (word in dictionary) continue; // nothing to do
                        auto newId = dictionary.length;
                        dictionary[word] = newId;
                        writefln("%s\t%s", newId, word);
                }
        }
}

> ...

Okay, suppose you're right.  The behavior is still incorrect because the associative array has allowed two identical keys...identical because the only difference between two strings which I care about are the contents of their character arrays.

> Also, it
> would be _really_ annoying to have to mark variables mutable all over the place
> as you would inevitably have to do.

Obviously your other points are valid, but I haven't found this to be true (Clojure is pure joy).  Maybe you're right because D is a systems language and mutability needs to be preferred, however after only a day or two of exposure to this language that assumption also appears to be wrong.  Take a look at Walter's first attempted patch to bug 2954: 13 lines altered to explicitly include immutable, and 4 altered to treat variables as const: http://www.dsource.org/projects/dmd/changeset/749

But I'm willing to admit that my exposure is limited, and that particular example is a little biased.

December 21, 2010

Re: string comparison

Posted by doubleagent
in reply to Lars T. Kyllingstad

doubleagent

Posted in reply to Lars T. Kyllingstad

> This could be related to bug 2954, for which a fix will be released in the next version of DMD.

Looking at that new descriptive error message ie error("associative arrays can only be assigned values with immutable keys, not %s", e2->type->toChars());  it appears to be a distinct possibility.  Thanks.

December 21, 2010

Re: string comparison

Posted by Jonathan M Davis
in reply to doubleagent

Jonathan M Davis

Posted in reply to doubleagent

On Monday, December 20, 2010 16:45:20 doubleagent wrote:
> > Okay. I don't know what the actual code looks like
> 
> Here.
> 
> import std.stdio, std.string;
> 
> void main() {
>         uint[string] dictionary; // v[k], so string->uint
>         foreach (line; stdin.byLine()) {
>                 // break sentence into words
>                 // Add each word in the sentence to the vocabulary
>                 foreach (word; splitter(strip(line))) {
>                         if (word in dictionary) continue; // nothing to do
>                         auto newId = dictionary.length;
>                         dictionary[word] = newId;
>                         writefln("%s\t%s", newId, word);
>                 }
>         }
> }
> 
> > ...
> 
> Okay, suppose you're right.  The behavior is still incorrect because the associative array has allowed two identical keys...identical because the only difference between two strings which I care about are the contents of their character arrays.

Array comparison cares about the contents of the array. It may shortcut comparisons if lengths differ or if they point to the same point in memory and have the same length, but array comparison is all about comparing their elements.

In this case, you'd have two arrays/strings which point to the same point in memory but have different lengths. Because their lengths differ, they'd be deemed unequal. If you managed to try and put a string in the associative array which has the same length as one that you already inserted, then they'll be considered equal, since their lengths are identical and they point to same point in memory, so in that case, I would expect the original value to be replaced with the new one. But other than that, the keys will be considered unequal in spite of the fact that they point to the same place in memory.

The real problem here is that associative arrays currently allow non-immutable keys. Once that's fixed, then it won't be a problem anymore.

> > Also, it
> > would be _really_ annoying to have to mark variables mutable all over the
> > place as you would inevitably have to do.
> 
> Obviously your other points are valid, but I haven't found this to be true (Clojure is pure joy).  Maybe you're right because D is a systems language and mutability needs to be preferred, however after only a day or two of exposure to this language that assumption also appears to be wrong.  Take a look at Walter's first attempted patch to bug 2954: 13 lines altered to explicitly include immutable, and 4 altered to treat variables as const: http://www.dsource.org/projects/dmd/changeset/749
> 
> But I'm willing to admit that my exposure is limited, and that particular example is a little biased.

Most programmers don't use const even in languages that have it. And with many programmers programming primarily in languages like Java or C# which don't really have const (IIRC, C# has more of a const than Java, but it's still pretty limited), many, many programmers never use const and see no value in it. So, for most programmers, mutable variables will be the norm, and they'll likely only use const or immutable if they have to. There are plenty of C++ programmers who will seek to use const (and possibly immutable) heavily, but they're definitely not the norm. And, of course, there are plenty of other languages out there with const or immutable types of one sort or another (particularly most functional languages), but those aren't the types of languages that most programmers use. The result is that most beginning D programmers will be looking for mutable to be the norm, and forcing const and/or immutable on them could be seriously off- putting.

Now, most code which is going to actually use const and immutable is likely to be a fair mix of mutable, const, and immutable - especially if you don't try to make everything immutable at the cost of efficiency like you'd typically get in a functional language. That being the case, regardless of whether mutable, const, or immutable is the default, you're going to have to mark a fair number of variables as something other than the default. So, making const or immutable the default would likely not save any typing, and it would annoy a _lot_ of programmers.

So, the overall gain of making const or immutable the default is pretty minimal if not outright negative.

Personally, I use const and immutable a lot, but I still  wouldn't want const or immutable to be the default.

- Jonathan M Davis

December 21, 2010

Re: string comparison

Posted by doubleagent
in reply to Jonathan M Davis

doubleagent

Posted in reply to Jonathan M Davis

Good & I agree.

December 21, 2010

Re: string comparison

Posted by Stanislav Blinov
in reply to Steven Schveighoffer

Stanislav Blinov

Posted in reply to Steven Schveighoffer

20.12.2010 22:06, Steven Schveighoffer пишет:
> On Mon, 20 Dec 2010 14:05:56 -0500, Steven Schveighoffer <schveiguy@yahoo.com> wrote:
>
>> On Mon, 20 Dec 2010 11:13:34 -0500, Stanislav Blinov <blinov@loniir.ru> wrote:
>>
>>> And lastly, hasn't this by chance been your first post? AFAIR, the first message is being moderated so it doesn't get to the public at once.
>>
>> BTW, this message board is not moderated.
>
> I should clarify, it's retroactively moderated :)  That is, if spam appears, it's allowed to go through, but then removed once discovered.
>
Citing what I got when first posted from my current address:

...
Your mail to 'Digitalmars-d-learn' with the subject

    Re: hijacking a class's members

Is being held until the list moderator can review it for approval.

The reason it is being held:

    Post to moderated list

Either the message will get posted to the list, or you will receive
notification of the moderator's decision.  If you would like to cancel
this posting, please visit the following URL:
...

I should note that I post via mailing list, so maybe this is the catch.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation