September 28, 2014
On 9/28/2014 11:51 AM, bearophile wrote:
> Walter Bright:
>
>> but do want to stop adding more autodecoding functions like the proposed
>> std.path.withExtension().
>
> I am not sure that can work. Perhaps you need to create a range2 and algorithm2
> modules, and keep adding some autodecoding functions to the old modules.

It can work just fine, and I wrote it. The problem is convincing someone to pull it :-( as the PR was closed and reopened with autodecoding put back in.

As I've explained many times, very few string algorithms actually need decoding at all. 'find', for example, does not. Trying to make a separate universe out of autodecoding algorithms is missing the point.

Certainly, setExtension() does not need autodecoding, and in fact all the autodecoding in it does is slow it down, allocate memory on errors, make it throwable, and produce dchar output, meaning at some point later you'll need to put it back to char.

I.e. there are no operations on paths that require decoding.

I know that you care about performance - you post about it often. I would expect that unnecessary and pervasive decoding would be of concern to you.
September 28, 2014
28-Sep-2014 23:44, Uranuz пишет:
>> I totally agree with all of that.
>>
>> It's one of those cases where correct by default is far too slow (that
>> would have to be graphemes) but fast by default is far too broken.
>> Better to force an explicit choice.
>>
>> There is no magic bullet for unicode in a systems language such as D.
>> The programmer must be aware of it and make choices about how to treat
>> it.
>
> I see didn't know about difference between byCodeUnit and
> byGrapheme, because I speak Russian and it is close to English,
> because it doesn't have diacritics. As far as I remember German,
> that I learned at school have diacritics. So you opened my eyes
> in this question. My position as usual programmer is that I
> speaking language which graphemes coded by 2 bytes

In UTF-16 and UTF-8.

> and I alwas
> need to do decoding otherwise my programme will be broken. Other
> possibility is to use wstring or dstring, but it is less memory
> efficient. Also UTF-8 is more commonly used in the Internet so I
> don't want to do some conversions to UTF-32, for example.
>
> Where I could read about byGrapheme?

std.uni docs:
http://dlang.org/phobos/std_uni.html#.byGrapheme

> Isn't this approach
> overcomplicated? I don't want to write Dostoevskiy's book "War
> and Peace" in order to write some parser for simple DSL.

It's Tolstoy actually:
http://en.wikipedia.org/wiki/War_and_Peace

You don't need byGrapheme for simple DSL. In fact as long as DSL is simple enough (ASCII only) you may safely avoid decoding. If it's in Russian you might want to decode. Even in this case there are ways to avoid decoding, it may involve a bit of writing in as for typical short novel ;)

In fact I did a couple of such literature exercises in std library.

For codepoint lookups on non-decoded strings:
http://dlang.org/phobos/std_uni.html#.utfMatcher

And to create sets of codepoints to detect with matcher:
http://dlang.org/phobos/std_uni.html#.CodepointSet

-- 
Dmitry Olshansky
September 28, 2014
On 9/28/14, 11:36 AM, Walter Bright wrote:
> Currently, the autodecoding functions allocate with the GC and throw as
> well. (They'll GC allocate an exception and throw it if they encounter
> an invalid UTF sequence. The adapters use the more common method of
> inserting a substitution character and continuing on.) This makes it
> harder to make GC-free Phobos code.

The right solution here is refcounted exception plus policy-based functions in conjunction with RCString. I can't believe this focus has already been lost and we're back to let's remove autodecoding and ban exceptions. -- Andrei
September 28, 2014
Walter Bright:

> It can work just fine, and I wrote it. The problem is convincing someone to pull it :-( as the PR was closed and reopened with autodecoding put back in.

Perhaps you need a range2 and algorithm2 modules. Introducing your changes in a sneaky way may not produce well working and predictable user code.


> I know that you care about performance - you post about it often. I would expect that unnecessary and pervasive decoding would be of concern to you.

I care first of all about program correctness (that's why I proposed unusual things like optional strong typing for built-in array indexes, or I proposed the "enum preconditions"). Secondly I care for performance in the functions or parts of code where performance is needed. There are plenty of code where performance is not the most important thing. That's why I have tons of range-base code. In such large parts of code having short, correct, nice looking code that looks correct is more important. Please don't assume I am simple minded :-)

Bye,
bearophile
September 28, 2014
On Sun, Sep 28, 2014 at 12:57:17PM -0700, Walter Bright via Digitalmars-d wrote:
> On 9/28/2014 11:51 AM, bearophile wrote:
> >Walter Bright:
> >
> >>but do want to stop adding more autodecoding functions like the
> >>proposed std.path.withExtension().
> >
> >I am not sure that can work. Perhaps you need to create a range2 and algorithm2 modules, and keep adding some autodecoding functions to the old modules.
> 
> It can work just fine, and I wrote it. The problem is convincing someone to pull it :-( as the PR was closed and reopened with autodecoding put back in.

The problem with pulling such PRs is that they introduce a dichotomy into Phobos. Some functions autodecode, some don't, and from a user's POV, it's completely arbitrary and random. Which leads to bugs because people can't possibly remember exactly which functions autodecode and which don't.


> As I've explained many times, very few string algorithms actually need decoding at all. 'find', for example, does not. Trying to make a separate universe out of autodecoding algorithms is missing the point.
[...]

Maybe what we need to do, is to change the implementation of std.algorithm so that it internally uses byCodeUnit for narrow strings where appropriate. We're already specialcasing Phobos code for narrow strings anyway, so it wouldn't make things worse by making those special cases not autodecode.

This doesn't quite solve the issue of composing ranges, since one composed range returns dchar in .front composed with another range will have autodecoding built into it. For those cases, perhaps one way to hack around the present situation is to use Phobos-private enums in the wrapper ranges (e.g., enum isNarrowStringUnderneath=true; in struct Filter or something), that ranges downstream can test for, and do the appropriate bypasses.

(BTW, before you pick on specific algorithms you might want to actually
look at the code for things like find(), because I remember there were a
couple o' PRs where find() of narrow strings will use (presumably) fast
functions like strstr or strchr, bypassing a foreach loop over an
autodecoding .front.)


T

-- 
I think Debian's doing something wrong, `apt-get install pesticide', doesn't seem to remove the bugs on my system! -- Mike Dresser
September 28, 2014
On 9/28/14, 1:39 PM, H. S. Teoh via Digitalmars-d wrote:
> On Sun, Sep 28, 2014 at 12:57:17PM -0700, Walter Bright via Digitalmars-d wrote:
>> On 9/28/2014 11:51 AM, bearophile wrote:
>>> Walter Bright:
>>>
>>>> but do want to stop adding more autodecoding functions like the
>>>> proposed std.path.withExtension().
>>>
>>> I am not sure that can work. Perhaps you need to create a range2 and
>>> algorithm2 modules, and keep adding some autodecoding functions to
>>> the old modules.
>>
>> It can work just fine, and I wrote it. The problem is convincing
>> someone to pull it :-( as the PR was closed and reopened with
>> autodecoding put back in.
>
> The problem with pulling such PRs is that they introduce a dichotomy
> into Phobos. Some functions autodecode, some don't, and from a user's
> POV, it's completely arbitrary and random. Which leads to bugs because
> people can't possibly remember exactly which functions autodecode and
> which don't.

I agree. -- Andrei


September 28, 2014
> It's Tolstoy actually:
> http://en.wikipedia.org/wiki/War_and_Peace
>
> You don't need byGrapheme for simple DSL. In fact as long as DSL is simple enough (ASCII only) you may safely avoid decoding. If it's in Russian you might want to decode. Even in this case there are ways to avoid decoding, it may involve a bit of writing in as for typical short novel ;)

Yes, my mistake ;) I was thinking about *Crime and Punishment* but writen *War and Peace*. Don't know why. May be because it is longer.

Thanks for useful links. As far as we are talking about standard library I think that some stanard aproach should be provided to solve often tasks: searching, sorting, parsing, splitting strings. I see that currently we have a lot of ways of doing similar things with strings. I think this is a problem of documentation at some part. When I parsing text I can't understand why I need to use all of these range interfaces instead of just manipulating on raw narrow string. We have several modules about working on strings: std.range, std.algorithm, std.string, std.array, std.utf and I can't see how they help me to solve my problems. In opposite they just creating me new problem to think of them in order to find *right* way. So most of my time I spend on thinking about it but not solving my task.

It is hard for me to accept that we don't need to decode to do some operations. What is annoying is that I always need to think of codelength that I should show to user and byte length that is used to slice char array. It's very easy to be confused with them and do something wrong.

I see that all is complicated we have 3 types of character and more than 5 modules for trivial manipulations on strings with 10ths of functions. It all goes into hell. But I don't even started to do my job. And we don't have *standard* way to deal with it in std lib. At least this way in not documented enough.
September 28, 2014
29-Sep-2014 00:39, H. S. Teoh via Digitalmars-d пишет:
> On Sun, Sep 28, 2014 at 12:57:17PM -0700, Walter Bright via Digitalmars-d wrote:
>> On 9/28/2014 11:51 AM, bearophile wrote:
>>> Walter Bright:
>>>
>>>> but do want to stop adding more autodecoding functions like the
>>>> proposed std.path.withExtension().
>>>
>>> I am not sure that can work. Perhaps you need to create a range2 and
>>> algorithm2 modules, and keep adding some autodecoding functions to
>>> the old modules.
>>
>> It can work just fine, and I wrote it. The problem is convincing
>> someone to pull it :-( as the PR was closed and reopened with
>> autodecoding put back in.
>
> The problem with pulling such PRs is that they introduce a dichotomy
> into Phobos. Some functions autodecode, some don't, and from a user's
> POV, it's completely arbitrary and random. Which leads to bugs because
> people can't possibly remember exactly which functions autodecode and
> which don't.
>

Agreed.

>
>> As I've explained many times, very few string algorithms actually need
>> decoding at all. 'find', for example, does not. Trying to make a
>> separate universe out of autodecoding algorithms is missing the point.
> [...]
>
> Maybe what we need to do, is to change the implementation of
> std.algorithm so that it internally uses byCodeUnit for narrow strings
> where appropriate. We're already specialcasing Phobos code for narrow
> strings anyway, so it wouldn't make things worse by making those special
> cases not autodecode.
>
> This doesn't quite solve the issue of composing ranges, since one
> composed range returns dchar in .front composed with another range will
> have autodecoding built into it. For those cases, perhaps one way to
> hack around the present situation is to use Phobos-private enums in the
> wrapper ranges (e.g., enum isNarrowStringUnderneath=true; in struct
> Filter or something), that ranges downstream can test for, and do the
> appropriate bypasses.
>

We need to either generalize the hack we did for char[] and wchar[] or start creating a whole new phobos without auto-decoding.

I'm not sure what's best but the latter is more disruptive.

> (BTW, before you pick on specific algorithms you might want to actually
> look at the code for things like find(), because I remember there were a
> couple o' PRs where find() of narrow strings will use (presumably) fast
> functions like strstr or strchr, bypassing a foreach loop over an
> autodecoding .front.)
>

Yes, it has fast path.


-- 
Dmitry Olshansky
September 28, 2014
29-Sep-2014 00:33, Andrei Alexandrescu пишет:
> On 9/28/14, 11:36 AM, Walter Bright wrote:
>> Currently, the autodecoding functions allocate with the GC and throw as
>> well. (They'll GC allocate an exception and throw it if they encounter
>> an invalid UTF sequence. The adapters use the more common method of
>> inserting a substitution character and continuing on.) This makes it
>> harder to make GC-free Phobos code.

> The right solution here is refcounted exception plus policy-based
> functions in conjunction with RCString. I can't believe this focus has
> already been lost and we're back to let's remove autodecoding and ban
> exceptions. -- Andrei

I've already stated my perception of the "no stinking exceptions", and "no destructors 'cause i want it fast" elsewhere.

Code must be correct and fast, with correct being a precondition for any performance tuning and speed hacks.

Correct usually entails exceptions and automatic cleanup. I also do not believe the "exceptions have to be slow" motto, they are costly but proportion of such costs was largely exaggerated.

-- 
Dmitry Olshansky
September 28, 2014
29-Sep-2014 00:44, Uranuz пишет:
>> It's Tolstoy actually:
>> http://en.wikipedia.org/wiki/War_and_Peace
>>
>> You don't need byGrapheme for simple DSL. In fact as long as DSL is
>> simple enough (ASCII only) you may safely avoid decoding. If it's in
>> Russian you might want to decode. Even in this case there are ways to
>> avoid decoding, it may involve a bit of writing in as for typical
>> short novel ;)
>
> Yes, my mistake ;) I was thinking about *Crime and Punishment* but
> writen *War and Peace*. Don't know why. May be because it is longer.
>

Admittedly both are way too long for my taste :)

> Thanks for useful links. As far as we are talking about standard library
> I think that some stanard aproach should be provided to solve often
> tasks: searching, sorting, parsing, splitting strings. I see that
> currently we have a lot of ways of doing similar things with strings. I
> think this is a problem of documentation at some part.

Some of this is historical, in particular std.string is way older then std.algorithm.

> When I parsing
> text I can't understand why I need to use all of these range interfaces
> instead of just manipulating on raw narrow string. We have several
> modules about working on strings: std.range, std.algorithm, std.string,
> std.array,

std.range publicly imports std.array thus I really do not see why we still have std.array as standalone module.

 std.utf and I can't see how they help me to solve my
> problems. In opposite they just creating me new problem to think of them
> in order to find *right* way.

There is no *right* way, every level of abstraction has its uses. Also there is a bit of trade-off on performance vs easy/obvious/nice code.

> So most of my time I spend on thinking
> about it but not solving my task.

Takes time to get accustomed with a standard library. See also std.conv and std.format. String processing is indeed shotgun-ed across entire phobos.

> It is hard for me to accept that we don't need to decode to do some
> operations. What is annoying is that I always need to think of
> codelength that I should show to user and byte length that is used to
> slice char array. It's very easy to be confused with them and do
> something wrong.

As long as you use decoding primitives you keep getting back proper indices automatically. That must be what some folks considered correct way to do Unicode until it was apparent to everybody that Unicode is way more then this.

>
> I see that all is complicated we have 3 types of character and more than
> 5 modules for trivial manipulations on strings with 10ths of functions.
> It all goes into hell.

There are many tools, but when I write parsers I actually use almost none of them. Well, nowdays I'm going to use the stuff in std.uni like CodePointSet, utfMatcher etc. std.regex makes some use of these already, but prior to that std.utf.decode was my lone workhorse.

> But I don't even started to do my job. And we
> don't have *standard* way to deal with it in std lib. At least this way
> in not documented enough.

Well on the bright side consider that C has lots of broken functions in stdlib, and even some that are _never_ safe like "gets" ;)

-- 
Dmitry Olshansky