September 28, 2014
On 9/28/2014 1:38 PM, bearophile wrote:
> Walter Bright:
>
>> It can work just fine, and I wrote it. The problem is convincing someone to
>> pull it :-( as the PR was closed and reopened with autodecoding put back in.
>
> Perhaps you need a range2 and algorithm2 modules. Introducing your changes in a
> sneaky way may not produce well working and predictable user code.

I'm not suggesting sneaky ways. setExt() was a NEW function.


>> I know that you care about performance - you post about it often. I would
>> expect that unnecessary and pervasive decoding would be of concern to you.
>
> I care first of all about program correctness (that's why I proposed unusual
> things like optional strong typing for built-in array indexes, or I proposed the
> "enum preconditions").

Ok, but you implied at one point that you were not aware of which parts of your string code decoded and which did not. That's not consistent with being very careful about correctness.

Note that autodecode does not always happen - it doesn't happen for ranges of chars. It's very hard to look at piece of code and tell if autodecode is going to happen or not.

> Secondly I care for performance in the functions or parts
> of code where performance is needed. There are plenty of code where performance
> is not the most important thing. That's why I have tons of range-base code. In
> such large parts of code having short, correct, nice looking code that looks
> correct is more important. Please don't assume I am simple minded :-)

It's very hard to disable the autodecode when it is not needed, though the new .byCodeUnit has made that much easier.

September 28, 2014
On 9/28/2014 1:39 PM, H. S. Teoh via Digitalmars-d wrote:
>> It can work just fine, and I wrote it. The problem is convincing
>> someone to pull it :-( as the PR was closed and reopened with
>> autodecoding put back in.
>
> The problem with pulling such PRs is that they introduce a dichotomy
> into Phobos. Some functions autodecode, some don't, and from a user's
> POV, it's completely arbitrary and random. Which leads to bugs because
> people can't possibly remember exactly which functions autodecode and
> which don't.

That's ALREADY the case, as I explained to bearophile.

The solution is not to have the ranges autodecode, but to have the ALGORITHMS decide to autodecode (if they need it) or not (if they don't).


>> As I've explained many times, very few string algorithms actually need
>> decoding at all. 'find', for example, does not. Trying to make a
>> separate universe out of autodecoding algorithms is missing the point.
> [...]
>
> Maybe what we need to do, is to change the implementation of
> std.algorithm so that it internally uses byCodeUnit for narrow strings
> where appropriate. We're already specialcasing Phobos code for narrow
> strings anyway, so it wouldn't make things worse by making those special
> cases not autodecode.

Those special cases wind up going everywhere and impacting everyone who attempts to write generic algorithms.


> This doesn't quite solve the issue of composing ranges, since one
> composed range returns dchar in .front composed with another range will
> have autodecoding built into it. For those cases, perhaps one way to
> hack around the present situation is to use Phobos-private enums in the
> wrapper ranges (e.g., enum isNarrowStringUnderneath=true; in struct
> Filter or something), that ranges downstream can test for, and do the
> appropriate bypasses.

More complexity :-( for what should be simple tasks.


> (BTW, before you pick on specific algorithms you might want to actually
> look at the code for things like find(), because I remember there were a
> couple o' PRs where find() of narrow strings will use (presumably) fast
> functions like strstr or strchr, bypassing a foreach loop over an
> autodecoding .front.)

Oh, I know that many algorithms have such specializations. Doesn't it strike you as sucky to have to special case a whole basket of algorithms when the InputRange does not behave in a reliable manner?

It's very simple for an algorithm to decode if it needs to, it just adds in a .byDchar adapter to its input range. Done. No special casing needed. The lines of code written drop in half. And it works with both arrays of chars, arrays of dchars, and input ranges of either.

---

The stalling of setExt() has basically halted my attempts to adjust Phobos so that one can write nothrow and @nogc algorithms that work on strings.
September 28, 2014
On 9/28/2014 1:33 PM, Andrei Alexandrescu wrote:
> On 9/28/14, 11:36 AM, Walter Bright wrote:
>> Currently, the autodecoding functions allocate with the GC and throw as
>> well. (They'll GC allocate an exception and throw it if they encounter
>> an invalid UTF sequence. The adapters use the more common method of
>> inserting a substitution character and continuing on.) This makes it
>> harder to make GC-free Phobos code.
>
> The right solution here is refcounted exception plus policy-based functions in
> conjunction with RCString. I can't believe this focus has already been lost and
> we're back to let's remove autodecoding and ban exceptions. -- Andrei

Or setExt() can simply insert .byCodeUnit as I suggested in the PR, and it's done and working correctly and doesn't throw and doesn't allocate and goes fast.

Not everything in Phobos can be dealt with so easily, of course, but there's quite a bit of low hanging fruit of this nature we can just take care of now.
September 28, 2014
On 9/28/2014 2:00 PM, Dmitry Olshansky wrote:
> I've already stated my perception of the "no stinking exceptions", and "no
> destructors 'cause i want it fast" elsewhere.
>
> Code must be correct and fast, with correct being a precondition for any
> performance tuning and speed hacks.

Sure. I'm not arguing for preferring incorrect code.


> Correct usually entails exceptions and automatic cleanup. I also do not believe
> the "exceptions have to be slow" motto, they are costly but proportion of such
> costs was largely exaggerated.

I think it was you that suggested that instead of throwing on invalid UTF, that the replacement character be used instead? Or maybe not, I'm not quite sure.

Regardless, the replacement character method is widely used and accepted practice. There's no reason to throw.

September 29, 2014
On Sun, 28 Sep 2014 19:44:39 +0000
Uranuz via Digitalmars-d <digitalmars-d@puremagic.com> wrote:

> I speaking language which graphemes coded by 2 bytes
UCS-4? KOI8? my locale is KOI8, and i HATE D for assuming that everyone
one the planet using UTF-8 and happy with it. from my POV, almost all
string decoding is broken. string i got from filesystem? good god, lest
it not contain anything out of ASCII range! string i got from text
file? the same. string i must write to text file or stdout? oh, 'cmon,
what do you mean telling me "п©я─п╦п╡п╣я┌"?! i can't read that!


September 29, 2014
On 09/28/2014 02:23 AM, Andrei Alexandrescu wrote:
> front() should follow a simple pattern that's been very successful in
> HHVM: small inline function that covers most cases with "if (c < 0x80)"
> followed by an out-of-line function on the multicharacter case. That
> approach would make the cost of auto-decoding negligible in the
> overwhelming majority of cases.
>
> Andrei

Well, we're using the same trick for already 3 years now :).
https://github.com/D-Programming-Language/phobos/pull/299
September 29, 2014
On 09/28/2014 01:02 AM, Walter Bright wrote:
>
> It's the autodecode'ing front(), which is a fairly complex function.

At least for dmd it's caused by a long-standing compiler bug.
https://issues.dlang.org/show_bug.cgi?id=7625

https://github.com/D-Programming-Language/phobos/pull/2566
September 29, 2014
29-Sep-2014 03:48, Walter Bright пишет:
> On 9/28/2014 2:00 PM, Dmitry Olshansky wrote:
>> I've already stated my perception of the "no stinking exceptions", and
>> "no
>> destructors 'cause i want it fast" elsewhere.
>>
>> Code must be correct and fast, with correct being a precondition for any
>> performance tuning and speed hacks.
>
> Sure. I'm not arguing for preferring incorrect code.
>
>
>> Correct usually entails exceptions and automatic cleanup. I also do
>> not believe
>> the "exceptions have to be slow" motto, they are costly but proportion
>> of such
>> costs was largely exaggerated.
>
> I think it was you that suggested that instead of throwing on invalid
> UTF, that the replacement character be used instead? Or maybe not, I'm
> not quite sure.

Aye that was me. I'd much prefer nothrow decoding. There should be an option to throw on bad input though (and we have it already), for programs that do not expect to work with even partially broken input.

>
> Regardless, the replacement character method is widely used and accepted
> practice. There's no reason to throw.
>


-- 
Dmitry Olshansky
September 29, 2014
On Sunday, 28 September 2014 at 21:00:46 UTC, Dmitry Olshansky wrote:
> 29-Sep-2014 00:33, Andrei Alexandrescu пишет:
>> The right solution here is refcounted exception plus policy-based
>> functions in conjunction with RCString. I can't believe this focus has
>> already been lost and we're back to let's remove autodecoding and ban
>> exceptions. -- Andrei

Consider what end users are going to use first, then design the library to fit the use cases while retaining general usefulness.

If UTF-8 decoding cannot be efficiently done as a generic adapter then D's approach to generic programming is a failure: dead on arrival.

Generic programming is not supposed to be special-case-everything-programming. If you cannot do generic programming well on strings, then don't. Provide a concrete single dedicated utf-8 string type instead.

> I've already stated my perception of the "no stinking exceptions", and "no destructors 'cause i want it fast" elsewhere.
>
> Code must be correct and fast, with correct being a precondition for any performance tuning and speed hacks.
>
> Correct usually entails exceptions and automatic cleanup. I also do not believe the "exceptions have to be slow" motto, they are costly but proportion of such costs was largely exaggerated.

Correctness has nothing to do with exceptions and an exception-specific cleanup model. It has to with having a well specified model of memory management, understanding the model and implementing code to the model with rigour.

The alternative is to have isolates only, higher level constructs only, GC everywhere and uniform activision-records for everything (no conceptual stack). Many high level languages work this way, even in the 60s.

When it comes to exception efficiency you have many choices:

1. no exceptions, omit frame pointer

2. no extra overhead when not throwing, standard codegen and linking, slow unwind

3. no extra overhead when not throwing, nonstandard codegen, faster unwind

4. extra overhead when not throwing, nonstandard codegen, fast unwind

5. small extra overhead when not throwing, no RAII/single landingpad, omit frame pointer possible, very fast unwind (C-style longjmp)

6. hidden return error value , fixed medium overhead

D has selected standard C++ (2), which means low overhead when not throwing and that you can use regular C backend/linker. In a fully GC language I think that (5) is quite acceptable and what you usually want when doing a web service. You just bail out to the root and you never really acquire resources except for transactions that should be terminated in the handler before exiting (and they will time out anyway).

But there is no best model. There are trade offs based on what kind application you write.

So as usual it comes back to this: what kind of applications are D actually targeting?

D is becoming less and less a system level language, and more a more a compiled scripting framework.

The more special casing, the less transparent D becomes, and the more of a scripting framework it becomes.

A good system level language requires transparency:

- easy to visualize memory layout

- predictable compilation of code to machine language

- no fixed memory model

- no arbitrary limits and presumptions about execution model

- allows you to get close to the max hardware performance potential

I have a hard time picturing D as a system level language these days. And the "hacks" that try to make it GC free are not making it a better system level language (I don't consider @nogc to be a hack). It makes D even less transparent.

Despite all the flaws of the D1 compiler, D1 was fairly transparent.

You really need to decide if D is supposed to be primarily a system level programming or if it is supposed to provide system level programming as an after thought on top of an application level programming language.

Currently it is the latter and more so for every iteration.
September 29, 2014
Am Sun, 28 Sep 2014 12:38:25 -0700
schrieb Walter Bright <newshound2@digitalmars.com>:

> I suggest that in the future write code that is explicit about the intention - by character or by decoded character - by using adapters .byChar or .byDchar.

... or by "user perceived character" or by "word" or by
"line". I'm always on the fence with code points. Sure they
are the code points, but what does it mean in practice?
Is it valid to start a Unicode string with just a diacritical
mark? Does it make sense to split in the middle of Korean
symbols, effectively removing parts of the glyphs and
rendering them invalid?

Bearophile, what does your code _do_ with the dchar ranges? How is it not rendered into a caricature of its own attempts to support non-ASCII by the above ?

-- 
Marco