View mode: basic / threaded / horizontal-split · Log in · Help
September 21, 2011
Re: D's confusing strings (was Re: D on hackernews)
On 9/21/11 1:20 PM, Christophe Travert wrote:
> Dealing with utfencoded strings is less efficient, but there is a number
> of algorithms that can be optimized for utfencoded strings, like copying
> or finding an ascii char in a string. Unfortunately, there is no
> practical way to do this with the current range API.

I'd love to hear more about that. The standard library does optimize 
certain algorithms for UTF strings.

Andrei
September 21, 2011
Re: D's confusing strings (was Re: D on hackernews)
Andrei Alexandrescu , dans le message (digitalmars.D:144936), a écrit :
> On 9/21/11 1:20 PM, Christophe Travert wrote:
>> Dealing with utfencoded strings is less efficient, but there is a number
>> of algorithms that can be optimized for utfencoded strings, like copying
>> or finding an ascii char in a string. Unfortunately, there is no
>> practical way to do this with the current range API.
> 
> I'd love to hear more about that. The standard library does optimize 
> certain algorithms for UTF strings.


Well, in that other thread called "Re: toUTFz and WinAPI 
GetTextExtentPoint32W/" in D.learn (what is the proper way to refer to 
a message here ?), I showed how to improve walkLength for strings and 
utf.stride.

About finding a character in a string, rather than relying 
on string.popFront, which makes the loop un-unrollable, 
we could search code unit per code unit directly. This is obviously 
better for ascii char, and I'll be looking for a nice idea for other 
code points (besides using find(Range, Range)).

I didn't review phobos with that idea in mind, and didn't do any 
benchmark exept the one for walkLength, but using string.popFront is a 
bad idea in term of performance, so work-arrounds are often better, and 
they are not that hard to find. I may do that when I have more time to 
give to D.

-- 
Christophe
September 21, 2011
Re: D's confusing strings (was Re: D on hackernews)
On Wednesday, September 21, 2011 19:56:47 Christophe wrote:
> "Jonathan M Davis" , dans le message (digitalmars.D:144922), a écrit :
> > 1. drop says nothing about slicing.
> > 2. popFrontN (which drop calls) says that it slices for ranges that
> > support slicing. Strings do not unless they're arrays of dchar.
> > 
> > Yes, hasSlicing should probably be clearer about narrow strings, but
> > that has nothing to do with drop.
> 
> I never said there was a problem with drop.

Yes you did. You said:

"mini-quiz: what should std.range.drop(some_string, 1) do ?
hint: what it actually does is not what the documentation of phobos 
suggests*..."

> After having read all of you, I have no problems with string being a
> lazy range of dchar. But I have a problem with immutable(char)[] being
> lazy range of dchar (ie not being a array), and I have a problem with
> string being immutable(char)[] (ie providing length opIndex and
> opSlice).

For efficiency, you need to be able to treat strings as arrays of code units for 
some algorithms. For correctness, you need to be able to treat them as ranges 
of code points (dchar) in the general case. You need both. The question is how 
to provide that. strings as arrays came first (D1), whereas ranges came later. 
We _need_ to treat strings as ranges of dchar or they're essentially unusable 
in the general case. Operating on code units is almost always _wrong_. So, 
when we added the range functions, we special-cased them for strings so that 
strings are treated as ranges of dchar as they need to be. And in cases where 
you actually need to treat a string as an array of code units for efficiency, 
you special case the function for them, and you still get that. What other way 
would you do it?

There _are_ some edges here - such as foreach defalting to char for string 
when dchar is really what you shoud be iterating with - and there are times 
when you want to use a string with a range-based function and can't, because 
it needs a random-access range or one which is sliceable to do what it does, 
which can be annoying. But what else can you do there? You can't treat the 
string as a range of code units in that case. The result would be completely 
wrong. Imagine if sort worked on a char[]. You'd get an array of sorted code 
units, which would _not_ be code points, and which would be completely 
useless. So, treating a string as a range of code units makes no sense.

We could switch to having a struct of some kind which was a string, make it a 
range of dchar, and have it contain an array of char, wchar, or dchar 
internally. It would have to restrict its operations in exactly the same 
manner that the range functions for strings currently do, so the exact same 
algorithms would or wouldn't work with it. And then you'd need to provide 
access to the underlying array of code units so that algorithms special casing 
strings could operate on the array instead. Ultimately, it's pretty much the 
same thing, except now you have a wrapper struct. How does that buy you 
anything? The _only_ thing that it would buy you AFAIK is that foreach would 
then default to dchar instead of the code unit type. The basic problem still 
exists. You still need to special case strings for efficiency, and you still 
need to treat them as a range of dchar in the general case. It's an inherent 
issue with variable length encodings. You can't just magically make it go 
away.

If you have a better solution, please share it, but the fact that we want both 
efficiency and correctness binds us pretty thoroughly here.

- Jonathan M Davis
September 21, 2011
Re: D's confusing strings (was Re: D on hackernews)
On 9/21/11 3:26 PM, Christophe Travert wrote:
> Andrei Alexandrescu , dans le message (digitalmars.D:144936), a écrit :
>> On 9/21/11 1:20 PM, Christophe Travert wrote:
>>> Dealing with utfencoded strings is less efficient, but there is a number
>>> of algorithms that can be optimized for utfencoded strings, like copying
>>> or finding an ascii char in a string. Unfortunately, there is no
>>> practical way to do this with the current range API.
>>
>> I'd love to hear more about that. The standard library does optimize
>> certain algorithms for UTF strings.
>
>
> Well, in that other thread called "Re: toUTFz and WinAPI
> GetTextExtentPoint32W/" in D.learn (what is the proper way to refer to
> a message here ?), I showed how to improve walkLength for strings and
> utf.stride.

Interesting, thanks.

> About finding a character in a string, rather than relying
> on string.popFront, which makes the loop un-unrollable,
> we could search code unit per code unit directly. This is obviously
> better for ascii char, and I'll be looking for a nice idea for other
> code points (besides using find(Range, Range)).
>
> I didn't review phobos with that idea in mind, and didn't do any
> benchmark exept the one for walkLength, but using string.popFront is a
> bad idea in term of performance, so work-arrounds are often better, and
> they are not that hard to find. I may do that when I have more time to
> give to D.

That sounds great. Looking forward to your pull requests!

Andrei
September 21, 2011
Re: D's confusing strings (was Re: D on hackernews)
Jonathan M Davis , dans le message (digitalmars.D:144944), a écrit :
>> I never said there was a problem with drop.
> 
> Yes you did. You said:
> 
> "mini-quiz: what should std.range.drop(some_string, 1) do ?
> hint: what it actually does is not what the documentation of phobos 
                                             ^^^^^^^^^^^^^^^^^^^^^^^
> suggests*..."
not documentation of drop.

> If you have a better solution, please share it, but the fact that we want both 
> efficiency and correctness binds us pretty thoroughly here.

- char[], etc. being real arrays.
- strings being lazy ranges of dchar, providing access to underlying 
char[].

Correctness of the langage is better, since we don't have a T[] having a 
front method that returns something else than T, or a type that accepts 
opSlice but is not sliceable, etc.

Runtime correctness and efficiency are the same as the current ones, 
since the whole phobos already considers strings as lazy range of dchar. 
It is even better, since the user cannot change an arbitrary code point 
in a string without explicitely asking for the undelying char[]. 
Optimizations can come the same way as they currently can, since the 
underlying char is accessible.

I can deal with strings the way they are, since they are an heritage. 
They are not perfect, and will never be unless computers become fat 
enough to treat dchar[] just as efficiently as char[]. I am also aware 
that phobos cannot be optimized for every cases in the first place, and 
I can change my mind.

-- 
Christophe
September 21, 2011
Re: D's confusing strings (was Re: D on hackernews)
On Wednesday, September 21, 2011 14:08 Christophe wrote:
> Jonathan M Davis , dans le message (digitalmars.D:144944), a écrit :
> >> I never said there was a problem with drop.
> > 
> > Yes you did. You said:
> > 
> > "mini-quiz: what should std.range.drop(some_string, 1) do ?
> > hint: what it actually does is not what the documentation of phobos
> 
> ^^^^^^^^^^^^^^^^^^^^^^^
> 
> > suggests*..."
> 
> not documentation of drop.

You weren't specific enough to make it clear what you meant. It looked like
you were complaining about drop's documentation.

> > If you have a better solution, please share it, but the fact that we want
> > both efficiency and correctness binds us pretty thoroughly here.
> 
> - char[], etc. being real arrays.

Which is actually arguably a _bad_ thing, since it doesn't generally make 
sense to operate on individual chars. What you really want 99.99999999999% of 
the time is code points not code units.

> - strings being lazy ranges of dchar, providing access to underlying
> char[].
> 
> Correctness of the langage is better, since we don't have a T[] having a
> front method that returns something else than T, or a type that accepts
> opSlice but is not sliceable, etc.
> 
> Runtime correctness and efficiency are the same as the current ones,
> since the whole phobos already considers strings as lazy range of dchar.
> It is even better, since the user cannot change an arbitrary code point
> in a string without explicitely asking for the undelying char[].
> Optimizations can come the same way as they currently can, since the
> underlying char is accessible.
> 
> I can deal with strings the way they are, since they are an heritage.
> They are not perfect, and will never be unless computers become fat
> enough to treat dchar[] just as efficiently as char[]. I am also aware
> that phobos cannot be optimized for every cases in the first place, and
> I can change my mind.

So, essentially you're arguing for a wrapper around arrays of code units. That 
does add some benefits (such as making foreach default to dchar), but 
ultimately doesn't add that much additional benefit (it also makes dealing 
with array literals much more interesting). If we were to start over again, 
that may very well be the way that we'd go, but the added benefits just don't 
outweigh the immense amount of code breakage which would result. Maybe the 
situation will change with D3, but at this point, I think that we've done a 
fairly good job of making it possible to treat strings as ranges.

- Jonathan M Davis
September 21, 2011
Re: D on hackernews
Am 21.09.2011, 16:24 Uhr, schrieb Andrei Alexandrescu  
<SeeWebsiteForEmail@erdani.org>:

> On 9/21/11 8:52 AM, Timon Gehr wrote:
>> On 09/21/2011 09:37 AM, bearophile wrote:
>>> Andrei Alexandrescu:
>>>
>>>> http://hackerne.ws/item?id=3014861
>>>>
>>>> Apparently we're still having a PR issue.
>>>
>>> I think the Wikipedia D page needs to be rewritten, leaving 80-90% of
>>> its space to D (meaning D2).
>>>
>>
>> Yes, that is important. Wikipedia is usually the first place people go
>> looking for information, and much of the information given there is
>> horribly outdated/wrong and mostly only concerns.
>
> Agreed. Does anyone volunteer for fixing D's Wikipedia page?
>
> Andrei

Is anyone uninvolved enough to be objective and involved enough to know  
what they write?
Timon, I think you are exaggerating a bit. It is not mostly only concerns,  
but I agree they have bold headers, and other language pages like Java or  
C++ lack this section entirely. Now it would certainly make people  
suspicious if the section disappeared over night. I would reduce the font  
weight of the headers in that section, remove the talk about UTF-8 string  
handling, at some point in time move the 'library split' issue to a  
historical section about D1. I think the focus on x86 is also a valid  
concern.

Here are some statistics I collected, for fun:

The Català version also says the following:
- It is unstable and unsuitable for production environments (version 0.140)

The Català and Galego version say:
- The only documentation is the official specification

The following languages are a direct translations of (or the source for)  
the English version. Maybe their authors can be contacted and are willing  
to update their language version after the change:
- Arabic
- Español
- Polski (not the exact same sections, probably translated from an older  
version)

The Italian version has the most impressive features list with 14 sections  
about characteristics!

The Latin version is blowing my mind, just because people who use a long  
dead language would write code in a new one that has to do with computers:

import tango.io.Console;

int main(char[][] args) {
    Cout("salve munde!");
    return 0;
}

Many language pages are hopelessly outdated, but speakers of 'minority'  
languages will look for an English article anyway.
September 21, 2011
Re: D on hackernews
On 09/22/2011 01:17 AM, Marco Leise wrote:
> Am 21.09.2011, 16:24 Uhr, schrieb Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org>:
>
>> On 9/21/11 8:52 AM, Timon Gehr wrote:
>>> On 09/21/2011 09:37 AM, bearophile wrote:
>>>> Andrei Alexandrescu:
>>>>
>>>>> http://hackerne.ws/item?id=3014861
>>>>>
>>>>> Apparently we're still having a PR issue.
>>>>
>>>> I think the Wikipedia D page needs to be rewritten, leaving 80-90% of
>>>> its space to D (meaning D2).
>>>>
>>>
>>> Yes, that is important. Wikipedia is usually the first place people go
>>> looking for information, and much of the information given there is
>>> horribly outdated/wrong and mostly only concerns.
>>
>> Agreed. Does anyone volunteer for fixing D's Wikipedia page?
>>
>> Andrei
>
> Is anyone uninvolved enough to be objective and involved enough to know
> what they write?
> Timon, I think you are exaggerating a bit. It is not mostly only
> concerns, but I agree they have bold headers, and other language pages
> like Java or C++ lack this section entirely.

On 09/21/2011 04:33 PM, Timon Gehr wrote:
>> Yes, that is important. Wikipedia is usually the first place people go
>> looking for information, and much of the information given there is
>> horribly outdated/wrong and mostly only concerns.
>
> ... concerns [D1].

That was just a bad place to accidentally leave out a word. ;)
I agree that the article also contains some useful information.
September 22, 2011
Re: D's confusing strings (was Re: D on hackernews)
"Jonathan M Davis" , dans le message (digitalmars.D:144962), a écrit :
>> - char[], etc. being real arrays.
> 
> Which is actually arguably a _bad_ thing, since it doesn't generally make 
> sense to operate on individual chars. What you really want 99.99999999999% of 
> the time is code points not code units.

Well, char could also disapear in favor of ubyte, but that will confuse 
even more people.

> If we were to start over again, that may very well be the way that 
> we'd go, but the added benefits just don't outweigh the immense amount 
> of code breakage which would result.

I 100% agree with that.

-- 
Christophe
Next ›   Last »
1 2 3
Top | Discussion index | About this forum | D home