D's confusing strings (was Re: D on hackernews) (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » D's confusing strings (was Re: D on hackernews) (page 3)

September 21, 2011

Re: D's confusing strings (was Re: D on hackernews)

Posted by Andrei Alexandrescu
in reply to Christophe Travert

Andrei Alexandrescu

Posted in reply to Christophe Travert

On 9/21/11 1:20 PM, Christophe Travert wrote:
> Dealing with utfencoded strings is less efficient, but there is a number
> of algorithms that can be optimized for utfencoded strings, like copying
> or finding an ascii char in a string. Unfortunately, there is no
> practical way to do this with the current range API.

I'd love to hear more about that. The standard library does optimize certain algorithms for UTF strings.

Andrei

September 21, 2011

Re: D's confusing strings (was Re: D on hackernews)

Posted by Christophe Travert
in reply to Andrei Alexandrescu

Christophe Travert

Posted in reply to Andrei Alexandrescu

Andrei Alexandrescu , dans le message (digitalmars.D:144936), a écrit :
> On 9/21/11 1:20 PM, Christophe Travert wrote:
>> Dealing with utfencoded strings is less efficient, but there is a number of algorithms that can be optimized for utfencoded strings, like copying or finding an ascii char in a string. Unfortunately, there is no practical way to do this with the current range API.
> 
> I'd love to hear more about that. The standard library does optimize certain algorithms for UTF strings.

Well, in that other thread called "Re: toUTFz and WinAPI GetTextExtentPoint32W/" in D.learn (what is the proper way to refer to a message here ?), I showed how to improve walkLength for strings and utf.stride.

About finding a character in a string, rather than relying on string.popFront, which makes the loop un-unrollable, we could search code unit per code unit directly. This is obviously better for ascii char, and I'll be looking for a nice idea for other code points (besides using find(Range, Range)).

I didn't review phobos with that idea in mind, and didn't do any benchmark exept the one for walkLength, but using string.popFront is a bad idea in term of performance, so work-arrounds are often better, and they are not that hard to find. I may do that when I have more time to give to D.

-- 
Christophe

September 21, 2011

Re: D's confusing strings (was Re: D on hackernews)

Posted by Jonathan M Davis
in reply to Christophe

Jonathan M Davis

Posted in reply to Christophe

On Wednesday, September 21, 2011 19:56:47 Christophe wrote:
> "Jonathan M Davis" , dans le message (digitalmars.D:144922), a écrit :
> > 1. drop says nothing about slicing.
> > 2. popFrontN (which drop calls) says that it slices for ranges that
> > support slicing. Strings do not unless they're arrays of dchar.
> > 
> > Yes, hasSlicing should probably be clearer about narrow strings, but that has nothing to do with drop.
> 
> I never said there was a problem with drop.

Yes you did. You said:

"mini-quiz: what should std.range.drop(some_string, 1) do ?
hint: what it actually does is not what the documentation of phobos
suggests*..."

> After having read all of you, I have no problems with string being a
> lazy range of dchar. But I have a problem with immutable(char)[] being
> lazy range of dchar (ie not being a array), and I have a problem with
> string being immutable(char)[] (ie providing length opIndex and
> opSlice).

For efficiency, you need to be able to treat strings as arrays of code units for some algorithms. For correctness, you need to be able to treat them as ranges of code points (dchar) in the general case. You need both. The question is how to provide that. strings as arrays came first (D1), whereas ranges came later. We _need_ to treat strings as ranges of dchar or they're essentially unusable in the general case. Operating on code units is almost always _wrong_. So, when we added the range functions, we special-cased them for strings so that strings are treated as ranges of dchar as they need to be. And in cases where you actually need to treat a string as an array of code units for efficiency, you special case the function for them, and you still get that. What other way would you do it?

There _are_ some edges here - such as foreach defalting to char for string when dchar is really what you shoud be iterating with - and there are times when you want to use a string with a range-based function and can't, because it needs a random-access range or one which is sliceable to do what it does, which can be annoying. But what else can you do there? You can't treat the string as a range of code units in that case. The result would be completely wrong. Imagine if sort worked on a char[]. You'd get an array of sorted code units, which would _not_ be code points, and which would be completely useless. So, treating a string as a range of code units makes no sense.

We could switch to having a struct of some kind which was a string, make it a range of dchar, and have it contain an array of char, wchar, or dchar internally. It would have to restrict its operations in exactly the same manner that the range functions for strings currently do, so the exact same algorithms would or wouldn't work with it. And then you'd need to provide access to the underlying array of code units so that algorithms special casing strings could operate on the array instead. Ultimately, it's pretty much the same thing, except now you have a wrapper struct. How does that buy you anything? The _only_ thing that it would buy you AFAIK is that foreach would then default to dchar instead of the code unit type. The basic problem still exists. You still need to special case strings for efficiency, and you still need to treat them as a range of dchar in the general case. It's an inherent issue with variable length encodings. You can't just magically make it go away.

If you have a better solution, please share it, but the fact that we want both efficiency and correctness binds us pretty thoroughly here.

- Jonathan M Davis

September 21, 2011

Re: D's confusing strings (was Re: D on hackernews)

Posted by Andrei Alexandrescu
in reply to Christophe Travert

Andrei Alexandrescu

Posted in reply to Christophe Travert

On 9/21/11 3:26 PM, Christophe Travert wrote:
> Andrei Alexandrescu , dans le message (digitalmars.D:144936), a écrit :
>> On 9/21/11 1:20 PM, Christophe Travert wrote:
>>> Dealing with utfencoded strings is less efficient, but there is a number
>>> of algorithms that can be optimized for utfencoded strings, like copying
>>> or finding an ascii char in a string. Unfortunately, there is no
>>> practical way to do this with the current range API.
>>
>> I'd love to hear more about that. The standard library does optimize
>> certain algorithms for UTF strings.
>
>
> Well, in that other thread called "Re: toUTFz and WinAPI
> GetTextExtentPoint32W/" in D.learn (what is the proper way to refer to
> a message here ?), I showed how to improve walkLength for strings and
> utf.stride.

Interesting, thanks.

> About finding a character in a string, rather than relying
> on string.popFront, which makes the loop un-unrollable,
> we could search code unit per code unit directly. This is obviously
> better for ascii char, and I'll be looking for a nice idea for other
> code points (besides using find(Range, Range)).
>
> I didn't review phobos with that idea in mind, and didn't do any
> benchmark exept the one for walkLength, but using string.popFront is a
> bad idea in term of performance, so work-arrounds are often better, and
> they are not that hard to find. I may do that when I have more time to
> give to D.

That sounds great. Looking forward to your pull requests!

Andrei

September 21, 2011

Re: D's confusing strings (was Re: D on hackernews)

Posted by Christophe
in reply to Jonathan M Davis

Christophe

Posted in reply to Jonathan M Davis

Jonathan M Davis , dans le message (digitalmars.D:144944), a écrit :
>> I never said there was a problem with drop.
> 
> Yes you did. You said:
> 
> "mini-quiz: what should std.range.drop(some_string, 1) do ?
> hint: what it actually does is not what the documentation of phobos
                                              ^^^^^^^^^^^^^^^^^^^^^^^
> suggests*..."
not documentation of drop.

> If you have a better solution, please share it, but the fact that we want both efficiency and correctness binds us pretty thoroughly here.

- char[], etc. being real arrays.
- strings being lazy ranges of dchar, providing access to underlying
char[].

Correctness of the langage is better, since we don't have a T[] having a front method that returns something else than T, or a type that accepts opSlice but is not sliceable, etc.

Runtime correctness and efficiency are the same as the current ones, since the whole phobos already considers strings as lazy range of dchar. It is even better, since the user cannot change an arbitrary code point in a string without explicitely asking for the undelying char[]. Optimizations can come the same way as they currently can, since the underlying char is accessible.

I can deal with strings the way they are, since they are an heritage. They are not perfect, and will never be unless computers become fat enough to treat dchar[] just as efficiently as char[]. I am also aware that phobos cannot be optimized for every cases in the first place, and I can change my mind.

-- 
Christophe

September 21, 2011

Re: D's confusing strings (was Re: D on hackernews)

Posted by Jonathan M Davis
in reply to Christophe

Jonathan M Davis

Posted in reply to Christophe

On Wednesday, September 21, 2011 14:08 Christophe wrote:
> Jonathan M Davis , dans le message (digitalmars.D:144944), a écrit :
> >> I never said there was a problem with drop.
> > 
> > Yes you did. You said:
> > 
> > "mini-quiz: what should std.range.drop(some_string, 1) do ?
> > hint: what it actually does is not what the documentation of phobos
> 
> ^^^^^^^^^^^^^^^^^^^^^^^
> 
> > suggests*..."
> 
> not documentation of drop.

You weren't specific enough to make it clear what you meant. It looked like you were complaining about drop's documentation.

> > If you have a better solution, please share it, but the fact that we want both efficiency and correctness binds us pretty thoroughly here.
> 
> - char[], etc. being real arrays.

Which is actually arguably a _bad_ thing, since it doesn't generally make sense to operate on individual chars. What you really want 99.99999999999% of the time is code points not code units.

> - strings being lazy ranges of dchar, providing access to underlying char[].
> 
> Correctness of the langage is better, since we don't have a T[] having a front method that returns something else than T, or a type that accepts opSlice but is not sliceable, etc.
> 
> Runtime correctness and efficiency are the same as the current ones, since the whole phobos already considers strings as lazy range of dchar. It is even better, since the user cannot change an arbitrary code point in a string without explicitely asking for the undelying char[]. Optimizations can come the same way as they currently can, since the underlying char is accessible.
> 
> I can deal with strings the way they are, since they are an heritage. They are not perfect, and will never be unless computers become fat enough to treat dchar[] just as efficiently as char[]. I am also aware that phobos cannot be optimized for every cases in the first place, and I can change my mind.

So, essentially you're arguing for a wrapper around arrays of code units. That does add some benefits (such as making foreach default to dchar), but ultimately doesn't add that much additional benefit (it also makes dealing with array literals much more interesting). If we were to start over again, that may very well be the way that we'd go, but the added benefits just don't outweigh the immense amount of code breakage which would result. Maybe the situation will change with D3, but at this point, I think that we've done a fairly good job of making it possible to treat strings as ranges.

- Jonathan M Davis

September 21, 2011

Re: D on hackernews

Posted by Marco Leise
in reply to Andrei Alexandrescu

Marco Leise

Posted in reply to Andrei Alexandrescu

Am 21.09.2011, 16:24 Uhr, schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:

> On 9/21/11 8:52 AM, Timon Gehr wrote:
>> On 09/21/2011 09:37 AM, bearophile wrote:
>>> Andrei Alexandrescu:
>>>
>>>> http://hackerne.ws/item?id=3014861
>>>>
>>>> Apparently we're still having a PR issue.
>>>
>>> I think the Wikipedia D page needs to be rewritten, leaving 80-90% of
>>> its space to D (meaning D2).
>>>
>>
>> Yes, that is important. Wikipedia is usually the first place people go
>> looking for information, and much of the information given there is
>> horribly outdated/wrong and mostly only concerns.
>
> Agreed. Does anyone volunteer for fixing D's Wikipedia page?
>
> Andrei

Is anyone uninvolved enough to be objective and involved enough to know what they write?
Timon, I think you are exaggerating a bit. It is not mostly only concerns, but I agree they have bold headers, and other language pages like Java or C++ lack this section entirely. Now it would certainly make people suspicious if the section disappeared over night. I would reduce the font weight of the headers in that section, remove the talk about UTF-8 string handling, at some point in time move the 'library split' issue to a historical section about D1. I think the focus on x86 is also a valid concern.

Here are some statistics I collected, for fun:

The Català version also says the following:
- It is unstable and unsuitable for production environments (version 0.140)

The Català and Galego version say:
- The only documentation is the official specification

The following languages are a direct translations of (or the source for) the English version. Maybe their authors can be contacted and are willing to update their language version after the change:
- Arabic
- Español
- Polski (not the exact same sections, probably translated from an older version)

The Italian version has the most impressive features list with 14 sections about characteristics!

The Latin version is blowing my mind, just because people who use a long dead language would write code in a new one that has to do with computers:

import tango.io.Console;

int main(char[][] args) {
    Cout("salve munde!");
    return 0;
}

Many language pages are hopelessly outdated, but speakers of 'minority' languages will look for an English article anyway.

September 21, 2011

Re: D on hackernews

Posted by Timon Gehr
in reply to Marco Leise

Timon Gehr

Posted in reply to Marco Leise

On 09/22/2011 01:17 AM, Marco Leise wrote:
> Am 21.09.2011, 16:24 Uhr, schrieb Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org>:
>
>> On 9/21/11 8:52 AM, Timon Gehr wrote:
>>> On 09/21/2011 09:37 AM, bearophile wrote:
>>>> Andrei Alexandrescu:
>>>>
>>>>> http://hackerne.ws/item?id=3014861
>>>>>
>>>>> Apparently we're still having a PR issue.
>>>>
>>>> I think the Wikipedia D page needs to be rewritten, leaving 80-90% of
>>>> its space to D (meaning D2).
>>>>
>>>
>>> Yes, that is important. Wikipedia is usually the first place people go
>>> looking for information, and much of the information given there is
>>> horribly outdated/wrong and mostly only concerns.
>>
>> Agreed. Does anyone volunteer for fixing D's Wikipedia page?
>>
>> Andrei
>
> Is anyone uninvolved enough to be objective and involved enough to know
> what they write?
> Timon, I think you are exaggerating a bit. It is not mostly only
> concerns, but I agree they have bold headers, and other language pages
> like Java or C++ lack this section entirely.

On 09/21/2011 04:33 PM, Timon Gehr wrote:
>> Yes, that is important. Wikipedia is usually the first place people go
>> looking for information, and much of the information given there is
>> horribly outdated/wrong and mostly only concerns.
>
> ... concerns [D1].

That was just a bad place to accidentally leave out a word. ;)
I agree that the article also contains some useful information.

September 22, 2011

Re: D's confusing strings (was Re: D on hackernews)

Posted by Christophe
in reply to Jonathan M Davis

Christophe

Posted in reply to Jonathan M Davis

"Jonathan M Davis" , dans le message (digitalmars.D:144962), a écrit :
>> - char[], etc. being real arrays.
> 
> Which is actually arguably a _bad_ thing, since it doesn't generally make sense to operate on individual chars. What you really want 99.99999999999% of the time is code points not code units.

Well, char could also disapear in favor of ubyte, but that will confuse even more people.

> If we were to start over again, that may very well be the way that we'd go, but the added benefits just don't outweigh the immense amount of code breakage which would result.

I 100% agree with that.

-- 
Christophe

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation