June 28, 2012
On 6/28/12 8:57 AM, Roman D. Boiko wrote:
> On Thursday, 28 June 2012 at 12:29:14 UTC, Andrei Alexandrescu wrote:
>> On 6/28/12 5:58 AM, Roman D. Boiko wrote:
>> Pedantically speaking, sheer timings say nothing without the
>> appropriate baselines.
>>
>> Andrei
>
> I used results of benchmarks for two such algorithms, which I like most,
> taken from here:
>
> Vigna, S. (2008). "Broadword implementation of rank/select queries".
> Experimental Algorithms: 154–168.
>
> http://en.wikipedia.org/wiki/Succinct_data_structure#cite_ref-vigna2008broadword_6-0
>
>
> Numbers should be valid for some C/C++ code executed on a machine that
> already existed back in 2008. I'm not sure there is a good baseline to
> compare. One option would be to benchmark random access to code points
> in a UTF-32 string. I also don't know about any D implementations of
> these algorithms, thus cannot predict how they would behave against
> dstring random access.
>
> But your statement that these timings say nothing is not fair, because
> they can be used to conclude that this speed should be enough for most
> practical use cases, especially if those use cases are known.

Well of course I've exaggerated a bit. My point is that mentioning "200 ns!!!" sounds to the uninformed ear as good as "2000 ns" or "20 ns", i.e. "an amount of time so short by human scale, it must mean fast". You need to compare e.g. against random access in an array etc.


Andrei
June 28, 2012
On Thursday, 28 June 2012 at 14:34:03 UTC, Andrei Alexandrescu wrote:
> Well of course I've exaggerated a bit. My point is that mentioning "200 ns!!!" sounds to the uninformed ear as good as "2000 ns" or "20 ns", i.e. "an amount of time so short by human scale, it must mean fast". You need to compare e.g. against random access in an array etc.
>
> Andrei

I have no benchmarks for plain array access on the same machine and compiler that authors used. However, it looks like two cache misses happen at most. If that is true, we may charge 100 ns each memory access + computation. I would claim that from those most time takes memory access, since the same algorithms take 35-50 ns for smaller arrays (up to 4Mbits which is about 512KB), but I'm not sure that my conclusions are definitely true.

Also, I made a mistake in another post. I should have said that it is possible to address arrays of up to 2^64 code units, but benchmarks are provided for data sizes in bits (i.e., up to 1GBit).

Asymptotically algorithms should require slightly smaller space overhead for bigger arrays: space complexity is O(N/logN). But memory address resolution may become slower. This is true for both Rank/Select algorithms and raw array access.

Again, please note that price is paid only once per code unit resolution (for Select) or code point calculation (for Rank). Subsequent nearby accesses should be very cheap.
June 28, 2012
My point (and the reason I somehow hijacked this thread) is that such functionality would be useful for random access over narrow strings. Currently random access is missing.

Also this approach fits nicely if random access is needed to Unicode characters, not just code points.

I don't see much practical value in proposals to reconsider current string implementation. Arguments have been presented that it improves consistency, but as Andrei replied, consistency itself has many dimensions. Proposed changes have not been described in detail, but from what I understood, they would make common use cases more verbose.

In contrast, I do see value in Select / Rank indexing. Does anybody agree?


August 23, 2012
Sorry to resurrect this thread, I've been very absent from D, and am just now going through all these old posts.

On Wed, 27 Jun 2012 17:41:14 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:

> On 06/27/2012 11:11 PM, Steven Schveighoffer wrote:

>> No, druntime, and include minimal utf support. We do the same thing with
>> AssociativeArray.
>>
>
> In this case it is misleading to call it a library type.

What I mean is, the compiler does not define the structure of it.  It simply knows it exists, and expects a certain API for it.

The type itself is purely defined in the library, and could possibly be used directly as a library type.

>>>> If you want immutable(char)[], use "abc".codeunits or equivalent.
>>>>
>>>
>>> I really don't want to type .codeunits, but I want to use
>>> immutable(char)[] everywhere. This 'library type' is just an interface
>>> change that makes writing nice and efficient code a kludge.
>>
>> When most string functions take strings, why would you want to use
>> immutable(char)[] everywhere?
>>
>
> Because the proposed 'string' interface is inconvenient to use and useless. It is a struct with one data member and no additionally
> maintained invariant, and it strictly narrows the essential parts of
> the interface to the data that is reachable without a large typing
> overhead. immutable(char)[] supports exactly the operations I usually
> need. Maybe I'm not representative.

Most usages of strings are to concatenate them, print them, use them as keys, read them from a stream, etc.  None of this requires direct access to the data.  They can be treated as a nebulous type.

So maybe you are in the minority.  I don't really know.

>>>> The current situation is not simple to understand.
>>>
>>> It is simple, even if not immediately obvious. It does not have to be
>>> immediately obvious without explanation. It needs to be convenient.

I will respond to this in a different way.  This is the confusing part:

string x;
assert(!hasLength!string);
assert(!isRandomAccessRange!string);
auto len = x.length;
auto c = x[0]; // wtf?

It is *always* going to cause people to question isRandomAccessRange and hasLength because clearly a string has the purported properties needed to satisfy both.

>>>> Generic code that accepts arrays has to special-case narrow-width
>>>> strings if you plan to
>>>> use phobos with them in some cases. That is a horrible situation.
>>>>
>>>
>>> Generic code accepts ranges, not arrays. All necessary (or maybe
>>> unnecessary, I don't know) special casing is already done for you in
>>> Phobos. The _only_ thing that is problematic is the inconsistent
>>> 'foreach' behaviour.
>>
>> Plenty of generic code specializes on arrays.
>>
>
> Ok, point taken. But plenty of generic code then specializes on
> strings as well. Would the net gain be so huge? There is also always
> the option of just not passing strings to some helper template function
> you defined.

The net gain is not in the reduction of specializations -- of course we will need specializations for strings because to do any less would make D extremely inefficient compared to other languages.

The gain is in the reduction of confusion.  We are asking our users "I know, I know, it's an array, but *please* pretend it's not! Don't listen to the compiler!"

And this is for a *BASIC* type of the language!

range.save() is the same thing.  It prevents nothing, and you have to take special care to avoid using basic operations (i.e. assign) and pretend they don't exist, even though they compile.  To me, that is worthless.  If a string does not support random access, then str[0] should not compile. period.

> You are right about the random-access part, but the definition of an
> array does not depend on the 'range' concept.

The range concept is notably more confusing with strings that are random access types but not random access ranges, even though they support all the properties needed for a random access range.

-Steve
1 2 3 4 5
Next ›   Last »