View mode: basic / threaded / horizontal-split · Log in · Help
June 27, 2012
Re: standard ranges
On 06/27/2012 11:11 PM, Steven Schveighoffer wrote:
> On Wed, 27 Jun 2012 16:55:49 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:
>
>> On 06/27/2012 10:22 PM, Steven Schveighoffer wrote:
>>> On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr@gmx.ch>
>>> wrote:
>>>
>>>> There is no reason for anyone to be confused about this endlessly. It
>>>> is simple to understand. Furthermore, think about the implications of a
>>>> library-defined string type: it just introduces the problem of what the
>>>> type of built-in string literals should be. This would cause endless
>>>> pain with type deduction, ifti, string mixins, ... A library-defined
>>>> string type cannot be a full string type. Pretending that it can has no
>>>> value.
>>>
>>> Default type of the literal should be the library type.
>>
>> Then it is not a library type, but a built-in type. Are you planning to
>> inject a dependency on Phobos into the compiler?
>
> No, druntime, and include minimal utf support. We do the same thing with
> AssociativeArray.
>

In this case it is misleading to call it a library type.

>>> If you want immutable(char)[], use "abc".codeunits or equivalent.
>>>
>>
>> I really don't want to type .codeunits, but I want to use
>> immutable(char)[] everywhere. This 'library type' is just an interface
>> change that makes writing nice and efficient code a kludge.
>
> When most string functions take strings, why would you want to use
> immutable(char)[] everywhere?
>

Because the proposed 'string' interface is inconvenient to use and 
useless. It is a struct with one data member and no additionally
maintained invariant, and it strictly narrows the essential parts of
the interface to the data that is reachable without a large typing
overhead. immutable(char)[] supports exactly the operations I usually
need. Maybe I'm not representative.

>>> Of course, it should by default work as a zero-terminated char * for C
>>> compatibility.
>>>
>>> The current situation is not simple to understand.
>>
>> It is simple, even if not immediately obvious. It does not have to be
>> immediately obvious without explanation. It needs to be convenient.
>
> Try sorting an array of ascii characters.
>

auto asciitext = cast(ubyte[])"I am ascii text";
sort(asciitext);


>>> Generic code that accepts arrays has to special-case narrow-width
>>> strings if you plan to
>>> use phobos with them in some cases. That is a horrible situation.
>>>
>>
>> Generic code accepts ranges, not arrays. All necessary (or maybe
>> unnecessary, I don't know) special casing is already done for you in
>> Phobos. The _only_ thing that is problematic is the inconsistent
>> 'foreach' behaviour.
>
> Plenty of generic code specializes on arrays.
>

Ok, point taken. But plenty of generic code then specializes on
strings as well. Would the net gain be so huge? There is also always
the option of just not passing strings to some helper template function
you defined.

There are multiple valid contradictory considerations on the topic, but
I have found the current way of dealing with strings very pleasant.

>>>> alias immutable(char)[] string is just fine.
>>>
>>> That is technically fine, but if phobos wants to treat immutable(char)[]
>>> as something other than an array, it is not fine.
>>>
>>> -Steve
>>
>> Phobos does not treat immutable(char)[] as something other than an
>> array. It does not treat all arrays uniformly though.
>
> It certainly does. An array by definition is a random-access range. It
> does not treat strings as random access ranges.
>
> -Steve

You are right about the random-access part, but the definition of an
array does not depend on the 'range' concept.
June 28, 2012
Re: standard ranges
On Wednesday, June 27, 2012 23:41:14 Timon Gehr wrote:
> On 06/27/2012 11:11 PM, Steven Schveighoffer wrote:
> > When most string functions take strings, why would you want to use
> > immutable(char)[] everywhere?
> 
> Because the proposed 'string' interface is inconvenient to use and
> useless. It is a struct with one data member and no additionally
> maintained invariant, and it strictly narrows the essential parts of
> the interface to the data that is reachable without a large typing
> overhead. immutable(char)[] supports exactly the operations I usually
> need. Maybe I'm not representative.

I think that a lot of programmers want to be able to use strings without 
worrying about any of the details (like unicode). The fact that foreach and 
the library don't treat strings the same is confusing, and the fact that 
narrow strings are ranges of dchar (with all that that implies with regards to 
the operations that they support) seems to confuse a lot of people. If we had 
a struct for a string type, then the usage would be consistent (always a range 
of dchar), allowing the average programmer to more or less ignore unicode 
considerations as long as they don't care about efficiency, but it would still 
allow those who _do_ care to get at the underlying representation. So, a 
struct would be an improvement in that regard.

But for those who know what they're doing with regards to unicode and 
understand the fact that foreach treats strings one way and the library treats 
them another way, it really isn't a problem. It works quite well (which is one 
of the reasons that Walter isn't too keen on changing strings). It just isn't 
terribly newbie-friendly.

- Jonathan M Davis
June 28, 2012
Re: standard ranges
On Thu, Jun 28, 2012 at 12:22 AM, Steven Schveighoffer
<schveiguy@yahoo.com> wrote:
> On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:
>
>> There is no reason for anyone to be confused about this endlessly. It
>> is simple to understand. Furthermore, think about the implications of a
>> library-defined string type: it just introduces the problem of what the
>> type of built-in string literals should be. This would cause endless
>> pain with type deduction, ifti, string mixins, ... A library-defined
>> string type cannot be a full string type. Pretending that it can has no
>> value.
>
>
> Default type of the literal should be the library type.  If you want
> immutable(char)[], use "abc".codeunits or equivalent.
>
> Of course, it should by default work as a zero-terminated char * for C
> compatibility.
>
> The current situation is not simple to understand.  Generic code that
> accepts arrays has to special-case narrow-width strings if you plan to use
> phobos with them in some cases.  That is a horrible situation.
>
>
>> alias immutable(char)[] string is just fine.
>
>
> That is technically fine, but if phobos wants to treat immutable(char)[] as
> something other than an array, it is not fine.
>
> -Steve

Currently strings below dstring are only applicable in ForwardRange
and below, but not RandomAccessRange as they should be.

-- 
Bye,
Gor Gyolchanyan.
June 28, 2012
Re: standard ranges
On Thursday, June 28, 2012 08:59:32 Gor Gyolchanyan wrote:
> On Thu, Jun 28, 2012 at 12:22 AM, Steven Schveighoffer
> 
> <schveiguy@yahoo.com> wrote:
> > On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:
> >> There is no reason for anyone to be confused about this endlessly. It
> >> is simple to understand. Furthermore, think about the implications of a
> >> library-defined string type: it just introduces the problem of what the
> >> type of built-in string literals should be. This would cause endless
> >> pain with type deduction, ifti, string mixins, ... A library-defined
> >> string type cannot be a full string type. Pretending that it can has no
> >> value.
> > 
> > Default type of the literal should be the library type.  If you want
> > immutable(char)[], use "abc".codeunits or equivalent.
> > 
> > Of course, it should by default work as a zero-terminated char * for C
> > compatibility.
> > 
> > The current situation is not simple to understand.  Generic code that
> > accepts arrays has to special-case narrow-width strings if you plan to use
> > phobos with them in some cases.  That is a horrible situation.
> > 
> >> alias immutable(char)[] string is just fine.
> > 
> > That is technically fine, but if phobos wants to treat immutable(char)[]
> > as
> > something other than an array, it is not fine.
> > 
> > -Steve
> 
> Currently strings below dstring are only applicable in ForwardRange
> and below, but not RandomAccessRange as they should be.

Except that they shouldn't be, because you can't do random access on a narrow 
string in O(1). If you can't index or slice a range in O(1), it has no 
business having those operations. The same goes for length. That's why narrow 
strings do not have any of those operations as far as ranges are concerned. 
Having those operations in anything worse than O(1) violates the algorithmic 
complexity guarantees that ranges are supposed to provide, which would 
seriously harm the efficiency of algorithms which rely on them. It's the same 
reason why std.container defines the algorithmic complexity of all the 
operations in std.container. If you want a random-access range which is a 
string type, you need dchar[], const(dchar)[], or dstring. That is very much 
on purpose and would not change even if strings were structs.

- Jonathan M Davis
June 28, 2012
Re: standard ranges
"Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
> completely consistent with regards to how it treats strings. The _only_ 
> inconsintencies are between the language and the library - namely how foreach 
> iterates on code units by default and the fact that while the language defines 
> length, slicing, and random-access operations for strings, the library 
> effectively does not consider strings to have them.

char[] is not treated as an array by the library, and is not treated as 
a RandomAccessRange. That is a second inconsistency, and it would be 
avoided is string were a struct.

I won't repeat arguments that were already said, but if it matters, to 
me, things should be such that:

- string is a druntime defined struct, with an undelying 
immutable(char)[]. It is a BidirectionalRange of dchar. Slicing is 
provided for convenience, but not as opSlice, since it is not O(1), but 
as a method with a separate name. Direct access to the underlying 
char[]/ubyte[] is provided.

- similar structs are provided to hold underlying const(char)[] and 
char[]

- similar structs are provided for wstring

- dstring is a druntime defined alias to dchar[] or a struct with the 
same functionalities for consistency with narrow string being struct.

- All those structs may be provided as a template.
struct string(T = immutable(char)) {...}
alias string(immutable(wchar)) wstring;
alias string(immutable(dchar)) dstring;

string(const(char)) and string(char) ... are the other types of 
strings.

- this string template could also be defined as a wrapper to convert 
any range of char/wchar into a range of dchar. That does not need to be 
in druntime. Only types necessary for string litterals should be in 
druntime.

- string should not be convertible to char*. Use toStringz to interface 
with c code, or the underlying char[] if you know you it is 
zero-terminated, at you own risk. Only string litterals need to be 
convertible to char*, and I would say that they should be 
zero-terminated only when they are directly used as char*, to allow the 
compiler to optimize memory.

- char /may/ disappear in favor of ubyte (or the contrary, or one could 
alias the other), if there is no other need to keep separate types that 
having strings that are different from ubyte[]. Only dchar is necessary, 
and it could just be called char.

That is ideal to me. Of course, I understand code compatibility is 
important, and compromises have to be made. The current situation is a 
compromise, but I don't like it because it is a WAT for every newcomer. 
But the last point, for example, would bring no more that code breakage. 
Such code breakage may make us find bugs however...

-- 
Christophe
June 28, 2012
Re: standard ranges
On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
> "Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
> > completely consistent with regards to how it treats strings. The _only_
> > inconsintencies are between the language and the library - namely how
> > foreach iterates on code units by default and the fact that while the
> > language defines length, slicing, and random-access operations for
> > strings, the library effectively does not consider strings to have them.

> char[] is not treated as an array by the library

Phobos _does_ treat char[] as an array. isDynamicArray!(char[]) is true, and 
char[] works with the functions in std.array. It's just that they're all 
special-cased appropriately to handle narrow strings properly. What it doesn't 
do is treat char[] as a range of char.

> and is not treated as a RandomAccessRange.

Which is what I already said.

> That is a second inconsistency, and it would be avoided is string were a 
struct.

No, it wouldn't. It is _impossible_ to implement length, slicing, and indexing 
for UTF-8 and UTF-16 strings in O(1). Whether you're using an array or a 
struct to represent them is irrelevant. And if you can't do those operations 
in O(1), then they can't be random access ranges.

The _only_ thing that using a struct for narrow strings fixes is the 
inconsistencies with foreach (it would then use dchar just like all of the 
range stuff does), and slicing, indexing, and length wouldn't be on it, 
eliminating the oddity of them existing but not considered to exist by range-
based functions. It _would_ make things somewhat nicer for newbies, but it 
would not give you one iota more of functionality. Narrow strings would still 
be bidirectional ranges but not access ranges, and you would still have to 
operate on the underlying array to operate on strings efficiently.

If we were to start from stratch, it probably would be better to go with a 
struct type for strings, but it would break far too much code for far too 
little benefit at this point. You need to understand the unicode stuff 
regardless - like the difference between code units and code points. So, if 
anything, the fact that strings are treated inconsistently and are treated as 
ranges of dchar - which confuses so many newbies - is arguably a _good_ thing 
in that it forces newbies to realize and understand the unicode issues 
involved rather than blindly using strings in a horribly inefficient manner as 
would inevitably occur with a struct string type.

So, no, the situation is not exactly ideal, and yes, a struct string type 
might have been a better solution, but I think that many of the folks who are 
pushing for a struct string type are seriously overestimating the problems 
that it would solve. Yes, it would make the language and library more 
consistent, but that's it. You'd still have to use strings in essentially the 
same way that you do now. It's just that you wouldn't have to explicitly use 
dchar with foreach, and you'd have to get at the property which returned the 
underlying array in order to operate on the code units as you need to do in 
many functions to make your code appropriately efficient rather than simply 
using the string that way directly by not using its range-based functions. 
There is a difference, but it's a lot smaller than many people seem to think.

- Jonathan M Davis
June 28, 2012
Re: standard ranges
Jonathan M Davis , dans le message (digitalmars.D:170872), a écrit :
> On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
>> "Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
>> > completely consistent with regards to how it treats strings. The _only_
>> > inconsintencies are between the language and the library - namely how
>> > foreach iterates on code units by default and the fact that while the
>> > language defines length, slicing, and random-access operations for
>> > strings, the library effectively does not consider strings to have them.
> 
>> char[] is not treated as an array by the library
> 
> Phobos _does_ treat char[] as an array. isDynamicArray!(char[]) is true, and 
> char[] works with the functions in std.array. It's just that they're all 
> special-cased appropriately to handle narrow strings properly. What it doesn't 
> do is treat char[] as a range of char.
> 
>> and is not treated as a RandomAccessRange.

All arrays are treated as RandomAccessRanges, except for char[] and 
wchar[]. So I think I am entitled to say that strings are not treated as 
arrays. An I would say I am also entitle to say strings are not normal 
ranges, since they define length, but have isLength as true, and define 
opIndex and opSlice, but are not RandomAccessRanges.

The fact that isDynamicArray!(char[]) is true, but 
isRandomAccessRange is not is just another aspect of the schizophrenia. 
The behavior of a templated function on a string will depend on which 
was used as a guard.

> 
> Which is what I already said.
> 
>> That is a second inconsistency, and it would be avoided is string were a 
> struct.
> 
> No, it wouldn't. It is _impossible_ to implement length, slicing, and indexing 
> for UTF-8 and UTF-16 strings in O(1). Whether you're using an array or a 
> struct to represent them is irrelevant. And if you can't do those operations 
> in O(1), then they can't be random access ranges.

I never said strings should support length and slicing. I even said 
they should not. foreach is inconsistent with the way strings are 
treated in phobos, but opIndex, opSlice and length, are inconsistent to. 
string[0] and string.front do not even return the same....

Please read my post a little bit more carefully before 
answering them.

About the rest of your post, I basically say the same as you in shorter 
terms, except that I am in favor of changing things (but I didn't even 
said they should be changed in my conclusion).

newcomers are troubled by this problem, and I think it is important. 
They will make mistakes when using both array and range functions on 
strings in the same algorithm, or when using array functions without 
knowing about utf8 encoding issues (the fact that array functions are 
also valid range functions if not for strings does not help). But I also 
think experienced programmers can be affected, because of inattention, 
reusing codes written by inexperienced programmers, or inappropriate 
template guards usage.

As a more general comment, I think having a consistent langage is a very 
important goal to achieve when designing a langage. It makes everything 
simpler, from langage design to user through compiler and library 
development. It may not be too late for D.

-- 
Christophe
June 28, 2012
Re: standard ranges
On Thursday, June 28, 2012 09:28:52 Christophe Travert wrote:
> I never said strings should support length and slicing. I even said
> they should not. foreach is inconsistent with the way strings are
> treated in phobos, but opIndex, opSlice and length, are inconsistent to.
> string[0] and string.front do not even return the same....
> 
> Please read my post a little bit more carefully before
> answering them.

You said this:

> char[] is not treated as an array by the library, and is not treated as 
> a RandomAccessRange. That is a second inconsistency, and it would be 
> avoided is string were a struct.

So, it looked to me like you were saying that making string a struct would 
make it so that it was a random access range, which would mean implementing 
length, opSlice, and opIndex.

- Jonathan M Davis
June 28, 2012
Re: standard ranges
On Thursday, 28 June 2012 at 09:49:19 UTC, Jonathan M Davis wrote:
>> char[] is not treated as an array by the library, and is not 
>> treated as a RandomAccessRange. That is a second 
>> inconsistency, and it would be avoided is string were a struct.
>
> So, it looked to me like you were saying that making string a 
> struct would
> make it so that it was a random access range, which would mean 
> implementing
> length, opSlice, and opIndex.

I think he meant that the problem would be solved because people 
would be less likely to expect it to be a random access range in 
the first place.

What troubles me most with having is(string == immutable(char)[]) 
is that it more or less precludes us from adding small string 
optimizations, etc. in the future…

David
June 28, 2012
Re: standard ranges
On Thursday, 28 June 2012 at 05:10:43 UTC, Jonathan M Davis wrote:
> On Thursday, June 28, 2012 08:59:32 Gor Gyolchanyan wrote:
>> Currently strings below dstring are only applicable in 
>> ForwardRange
>> and below, but not RandomAccessRange as they should be.
>
> Except that they shouldn't be, because you can't do random 
> access on a narrow
> string in O(1). If you can't index or slice a range in O(1), it 
> has no
> business having those operations. The same goes for length. 
> That's why narrow
> strings do not have any of those operations as far as ranges 
> are concerned.
> Having those operations in anything worse than O(1) violates 
> the algorithmic
> complexity guarantees that ranges are supposed to provide, 
> which would
> seriously harm the efficiency of algorithms which rely on them. 
> It's the same
> reason why std.container defines the algorithmic complexity of 
> all the
> operations in std.container. If you want a random-access range 
> which is a
> string type, you need dchar[], const(dchar)[], or dstring. That 
> is very much
> on purpose and would not change even if strings were structs.
>
> - Jonathan M Davis

Pedantically speaking, it is possible to index a string with 
about 50-51% memory overhead to get random access in 0(1) time. 
Best-performing algorithms can do random access in about 35-50 
nanoseconds per operation for strings up to tens of megabytes. 
For bigger strings (tested up to 1GB) or when some other 
memory-intensive calculations are performed simultaneously, 
random access takes up to 200 nanoseconds due to memory-access 
resolution process.
1 2 3 4 5
Top | Discussion index | About this forum | D home