June 28, 2012
On Thursday, 28 June 2012 at 09:58:02 UTC, Roman D. Boiko wrote:
> Pedantically speaking, it is possible to index a string with about 50-51% memory overhead to get random access in 0(1) time. Best-performing algorithms can do random access in about 35-50 nanoseconds per operation for strings up to tens of megabytes. For bigger strings (tested up to 1GB) or when some other memory-intensive calculations are performed simultaneously, random access takes up to 200 nanoseconds due to memory-access resolution process.

Just a remark, indexing would take O(N) operations and N/B memory transfers where N = str.length and B is size of cache buffer.
June 28, 2012
On Thursday, 28 June 2012 at 10:02:59 UTC, Roman D. Boiko wrote:
> On Thursday, 28 June 2012 at 09:58:02 UTC, Roman D. Boiko wrote:
>> Pedantically speaking, it is possible to index a string with about 50-51% memory overhead to get random access in 0(1) time. Best-performing algorithms can do random access in about 35-50 nanoseconds per operation for strings up to tens of megabytes. For bigger strings (tested up to 1GB) or when some other memory-intensive calculations are performed simultaneously, random access takes up to 200 nanoseconds due to memory-access resolution process.
>
> Just a remark, indexing would take O(N) operations and N/B memory transfers where N = str.length and B is size of cache buffer.

That being said, I would be against switching from string representation as arrays. Such switch would hardly help us solve any problems of practical importance better (by a significant degree) than they have to be solved with current design.

However, a struct could be created for indexing which I mentioned in two previous posts to give efficient random access for narrow strings (and arbitrary variable-length data stored consequently in arrays) without any significant overhead.

Respective algorithms are called Rank and Select, and there exist many variations of them (with different trade-offs, but some of them are arguably better than others).

I have investigated this question quite deeply in the last two weeks, because similar algorithms would be useful in my DCT project. If nobody else will implement them before me, I will eventually do that myself. It is just a matter of finding some free time, likely a week or two.
June 28, 2012
"David Nadlinger" , dans le message (digitalmars.D:170875), a écrit :
> On Thursday, 28 June 2012 at 09:49:19 UTC, Jonathan M Davis wrote:
>>> char[] is not treated as an array by the library, and is not treated as a RandomAccessRange. That is a second inconsistency, and it would be avoided is string were a struct.
>>
>> So, it looked to me like you were saying that making string a
>> struct would
>> make it so that it was a random access range, which would mean
>> implementing
>> length, opSlice, and opIndex.
> 
> I think he meant that the problem would be solved because people would be less likely to expect it to be a random access range in the first place.

Yes.

June 28, 2012
On Thursday, 28 June 2012 at 09:58:02 UTC, Roman D. Boiko wrote:
> Pedantically speaking, it is possible to index a string with about 50-51% memory overhead to get random access in 0(1) time. Best-performing algorithms can do random access in about 35-50 nanoseconds per operation for strings up to tens of megabytes. For bigger strings (tested up to 1GB) or when some other memory-intensive calculations are performed simultaneously, random access takes up to 200 nanoseconds due to memory-access resolution process.
This would support both random access to characters by their code point index in a string and determining code point index by code unit index.

If only the former is needed, space overhead decreases to 25% for 1K and <15% for 16K-1G string sizes (measured in number of code units, which is twice the number of bytes for wstring). Strings up to 2^64 code units would be supported.

This would also improve access speed significantly (by 10% for small strings and about twice for large).

June 28, 2012
On 6/28/12 5:28 AM, Christophe Travert wrote:
> As a more general comment, I think having a consistent langage is a very
> important goal to achieve when designing a langage. It makes everything
> simpler, from langage design to user through compiler and library
> development. It may not be too late for D.

In a way it's too late for any language in actual use. The "fog of language design" makes it nigh impossible to design a language/library combo that is perfectly consistent, not to mention the fact that consistency itself has many dimensions, some of which may be in competition.

We'd probably do things a bit differently if we started from scratch. As things are, D's strings have a couple of quirks but are very apt for good and efficient string manipulation where index computation in the code unit realm is combined with the range of code points realm. I suppose people who have an understanding of UTF don't have difficulty using D's strings. Above all, alea jacta est and there's little we can do about that save for inventing a time machine.


Andrei

June 28, 2012
On 6/28/12 5:58 AM, Roman D. Boiko wrote:
> Pedantically speaking, it is possible to index a string with about
> 50-51% memory overhead to get random access in 0(1) time.
> Best-performing algorithms can do random access in about 35-50
> nanoseconds per operation for strings up to tens of megabytes. For
> bigger strings (tested up to 1GB) or when some other memory-intensive
> calculations are performed simultaneously, random access takes up to 200
> nanoseconds due to memory-access resolution process.

Pedantically speaking, sheer timings say nothing without the appropriate baselines.

Andrei
June 28, 2012
On Thursday, 28 June 2012 at 12:29:14 UTC, Andrei Alexandrescu wrote:
> On 6/28/12 5:58 AM, Roman D. Boiko wrote:
> Pedantically speaking, sheer timings say nothing without the appropriate baselines.
>
> Andrei

I used results of benchmarks for two such algorithms, which I like most, taken from here:

Vigna, S. (2008). "Broadword implementation of rank/select queries". Experimental Algorithms: 154–168.

http://en.wikipedia.org/wiki/Succinct_data_structure#cite_ref-vigna2008broadword_6-0

Numbers should be valid for some C/C++ code executed on a machine that already existed back in 2008. I'm not sure there is a good baseline to compare. One option would be to benchmark random access to code points in a UTF-32 string. I also don't know about any D implementations of these algorithms, thus cannot predict how they would behave against dstring random access.

But your statement that these timings say nothing is not fair, because they can be used to conclude that this speed should be enough for most practical use cases, especially if those use cases are known.
June 28, 2012
On 06/28/2012 11:28 AM, Christophe Travert wrote:
> Jonathan M Davis , dans le message (digitalmars.D:170872), a écrit :
>> On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
>>> "Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
>>>> completely consistent with regards to how it treats strings. The _only_
>>>> inconsintencies are between the language and the library - namely how
>>>> foreach iterates on code units by default and the fact that while the
>>>> language defines length, slicing, and random-access operations for
>>>> strings, the library effectively does not consider strings to have them.
>>
>>> char[] is not treated as an array by the library
>>
>> Phobos _does_ treat char[] as an array. isDynamicArray!(char[]) is true, and
>> char[] works with the functions in std.array. It's just that they're all
>> special-cased appropriately to handle narrow strings properly. What it doesn't
>> do is treat char[] as a range of char.
>>
>>> and is not treated as a RandomAccessRange.
>
> All arrays are treated as RandomAccessRanges, except for char[] and
> wchar[]. So I think I am entitled to say that strings are not treated as
> arrays.

"Not treated like other arrays", is the strongest claim that can be
made there.

> An I would say I am also entitle to say strings are not normal
> ranges, since they define length, but have isLength as true,

hasLength as false. They define length, but it is not part of the range
interface.

It is analogous to the following:

class charArray : ForwardRange!dchar{
    /* interface ForwardRange!dchar */
    dchar front();
    bool empty();
    void popFront();
    NarrowString save();

    /* other methods */
    size_t length();
    char opIndex(size_t i);
    String opSlice(size_t a, size_t b);
}

> and define opIndex and opSlice,

[] and [..] operate on code units, but for a random access range as
defined by Phobos, they would not.

> but are not RandomAccessRanges.
>
> The fact that isDynamicArray!(char[]) is true, but
> isRandomAccessRange is not is just another aspect of the schizophrenia.
> The behavior of a templated function on a string will depend on which
> was used as a guard.
>

No, it won't.

>>
>> Which is what I already said.
>>
>>> That is a second inconsistency, and it would be avoided is string were a
>> struct.
>>
>> No, it wouldn't. It is _impossible_ to implement length, slicing, and indexing
>> for UTF-8 and UTF-16 strings in O(1). Whether you're using an array or a
>> struct to represent them is irrelevant. And if you can't do those operations
>> in O(1), then they can't be random access ranges.
>
> I never said strings should support length and slicing. I even said
> they should not. foreach is inconsistent with the way strings are
> treated in phobos, but opIndex, opSlice and length, are inconsistent to.
> string[0] and string.front do not even return the same....
>
> Please read my post a little bit more carefully before
> answering them.
>

This is very impolite.

On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
> Slicing is provided for convenience, but not as opSlice, since it is not O(1), but
> as a method with a separate name.


> About the rest of your post, I basically say the same as you in shorter
> terms, except that I am in favor of changing things (but I didn't even
> said they should be changed in my conclusion).
>

When read carefully, the conclusion says that code compatibility is
important only a couple sentences before it says that breaking code for
the fun of it may be a good thing.

> newcomers are troubled by this problem,  and I think it is important.

Newcomers sometimes become seasoned D programmers. Sometimes they know
what Unicode is about even before that.

> They will make mistakes when using both array and range functions on
> strings in the same algorithm, or when using array functions without
> knowing about utf8 encoding issues (the fact that array functions are
> also valid range functions if not for strings does not help). But I also
> think experienced programmers can be affected, because of inattention,
> reusing codes written by inexperienced programmers, or inappropriate
> template guards usage.

In the ASCII-7 subset, UTF-8 strings are actually random access, and
iterating an UTF-8 string by code point is safe if you are eg. just
going to treat some ASCII characters specially.

I don't care much whether or not (bad?) code handles Unicode correctly,
but it is important that code correctly documents whether or not it
does so, and to what extent it does. The new std.regex has good Unicode
support, and to enable that, it had to add some pretty large tables to
Phobos, the functionality of which is not exposed to the library user
as of now. It is therefore safe to say that many/most existing D
programs do not handle the whole Unicode standard correctly.

Unicode has to be _actively_ supported. There are distinct issues that
are hard to abstract away efficiently. Treating an Unicode string as a
range of code points is not solving them. (dchar[] indexing is still
not guaranteed to give back the 'i'th character!) Why build this
interpretation into the language?

>
> As a more general comment, I think having a consistent langage is a very
> important goal to achieve when designing a langage. It makes everything
> simpler, from langage design to user through compiler and library
> development. It may not be too late for D.
>

The language is consistent here. The library treats some language
features specially. It is not the language that is "confusing". The
whole reason to introduce the library behaviour is probably based on
similar reasoning as given in your post. The special casing has not
caused me any trouble, and sometimes it was useful.
June 28, 2012
Timings should not be very different from random access in any UTF-32 string implementation, because of design of these algorithms:

* only operations on 64-bit aligned words are performed (addition, multiplication, bitwise and shift operations)

* there is no branching except at the very top level for very large array sizes

* data is stored in a way that makes algorithms cache-oblivious IIRC. Authors claim that very few cache misses are neccessary (1-2 per random access).

* after determining code unit index for some code point index further access is performed as usually inside an array, so in order to perform slicing it is only needed to calculate code unit indices for its end and start.

* original data arrays are not modified (unlike for compact representations of dstring, for example).
June 28, 2012
Timon Gehr , dans le message (digitalmars.D:170884), a écrit :
>> An I would say I am also entitle to say strings are not normal ranges, since they define length, but have isLength as true,
> 
> hasLength as false.
Of course, my mistake.

> They define length, but it is not part of the range interface.
> 
> It is analogous to the following:
> [...]

I consider this bad design.

>> and define opIndex and opSlice,
> 
> [] and [..] operate on code units, but for a random access range as defined by Phobos, they would not.

A bidirectional range of dchar with additional methods of a random access range of char. That is what I call schizophrenic.

>> but are not RandomAccessRanges.
>>
>> The fact that isDynamicArray!(char[]) is true, but
>> isRandomAccessRange is not is just another aspect of the schizophrenia.
>> The behavior of a templated function on a string will depend on which
>> was used as a guard.
>>
> No, it won't.

Take the isRandomAccessRange specialization of an algorithm in Phobos, replace the guard by isDynamicArray, and you are very likely to change the behavior, if you do not simply break the function.

> When read carefully, the conclusion says that code compatibility is important only a couple sentences before it says that breaking code for the fun of it may be a good thing.

It was intended as a side-note, not a conclusion. Sorry for not being clear.

>> newcomers are troubled by this problem,  and I think it is important.
> 
> Newcomers sometimes become seasoned D programmers. Sometimes they know what Unicode is about even before that.

I knew what unicode was before coming to D. But, strings being arrays, I suspected myString.front would return the same as myString[0], i.e., a char, and that it was my job to make sure my algorithms were valid for UTF-8 encoding if I wanted to support it. Most of the time, in langage without such UTF-8 support, they are without much troubles. Code units matters more than code points most of the time.

> The language is consistent here. The library treats some language features specially. It is not the language that is "confusing". The whole reason to introduce the library behaviour is probably based on similar reasoning as given in your post.

OK, I should have said the standard library is inconsistent (with the langage).

> The special casing has not caused me any trouble, and sometimes it was useful.

Of course, I can live with that.