May 13, 2016
On Fri, May 13, 2016 at 09:26:40PM +0200, Marco Leise via Digitalmars-d wrote:
> Am Fri, 13 May 2016 10:49:24 +0000
> schrieb Marc Schütz <schuetzm@gmx.net>:
> 
> > In fact, even most European languages are affected if NFD normalization is used, which is the default on MacOS X.
> > 
> > And this is actually the main problem with it: It was introduced to make unicode string handling correct. Well, it doesn't, therefore it has no justification.
> 
> +1 for leaning back and contemplate exactly what auto-decode was aiming for and how it missed that goal.
> 
> You'll see that an ö may still be cut between the o and the ¨. Hangul symbols are composed of pieces that go in different corners. Those would also be split up by auto-decode.
> 
> Can we handle real world text AT ALL? Are graphemes good enough to find the column in a fixed width display of some string (e.g. line+column or an error)? No, there my still be full-width characters in there that take 2 columns. :p
[...]

A simple lookup table ought to fix this. Preferably in std.uni so that it doesn't get reinvented by every other project.


T

-- 
Don't modify spaghetti code unless you can eat the consequences.
May 15, 2016
On Sunday, 15 May 2016 at 01:45:25 UTC, Bill Hicks wrote:
> From a technical point, D is not successful, for the most part.
>  C/C++ at least can use the excuse that they were created during a time when we didn't have the experience and the knowledge that we do now.

Not really. The dominating precursor to C, BCPL was a bootstrapping language for CPL. C was a quick hack to implement Unix. C++ has always been viewed as a hack and was heavily criticised since its inception as a ugly bastardized language that got many things wrong. Reality is, current main stream programming languages draw on theory that has been well understood for 40+ years.  There is virtually no innovation, but a lot of repeated mistakes.

Some esoteric languages draw on more modern concepts and innovate, but I can't think of a single mainstream language that does that.


> If by successful you mean the size of the user base, then D doesn't have that either.  The number of D users is most definitely less than 10k.  The number of people who have tried D is no doubt greater than that, but that's the thing with D, it has a low retention rate, for obvious reasons.

Yes, but D can make breaking changes, something C++ cannot do. Unfortunately there is no real willingness to clean up the language, so D is moving way too slow to become competitive. But that is more of a cultural issue than a language issue.

I am personally increasingly involved with C++, but unfortunately, there is no single C++ language. The C/C++ committees have unfortunately tried to make the C-languages more high performant and high level at the cost of correctness. So, now you either have to do heavy code reviews or carefully select compiler options to get a sane C++ environment.

Like, in modern C/C++ the compiler assumes that there is no aliasing between pointers to different types. So if I cast a scalar float pointer to a simd pointer I either have to:

1. make sure that I turn off that assumption by using the compiler switch "-fno-strict-aliasing" and add "__restrict__" where I know there is no aliasing, or

2. Put __may_alias__ on my simd pointers.

3. Carefully place memory barriers between pointer type casts.

4. Dig into the compiler internals to figure out what it does.

C++ is trying way too hard to become a high level language, without the foundation to support it. This is an area where D could do well, but it isn't doing enough to get there, neither on the theoretical level or the implementation level.

Rust seems to try, but I don't think they will make it as they don't seem to have a broad view of programming. Maybe someone will build a new language over the Rust mid-level IR (MIR) that will be successful. I'm hopeful, but hey, it won't happen in less than 5 years.

Until then there is only three options for C++ish progamming: C++,  D and Loci. Currently C++ is the path of least resistance (but with very high initial investment, 1+ year for an experienced educated programmer).

So clearly a language comparable to D _could_ make headway, but not without a philosophical change that makes it a significant improvement over C++ and systematically adresses the C++ short-comings one by one (while retaining the application area and basic programming model).

May 15, 2016
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
> On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
> > I am as unclear about the problems of autodecoding as I am
> about the necessity
> > to remove curl. Whenever I ask I hear some arguments that
> work well emotionally
> > but are scant on reason and engineering. Maybe it's time to
> rehash them? I just
> > did so about curl, no solid argument seemed to come together.
> I'd be curious of
> > a crisp list of grievances about autodecoding. -- Andrei
>

Given the importance of performance in the auto-decoding topic, it seems reasonable to quantify it. I took a stab at this. It would of course be prudent to have others conduct similar analysis rather than rely on my numbers alone.

Measurements were done using an artificial scenario, counting lower-case ascii letters. This had the effect of calling front/popFront many times on a long block of text. Runs were done both treating the text as char[] and ubyte[] and comparing the run times. (char[] performs auto-decoding, ubyte[] does not.)

Timings were done with DMD and LDC, and on two different data sets. One data set was a mix of latin languages (e.g. German, English, Finnish, etc.), the other non-Latin languages (e.g. Japanese, Chinese, Greek, etc.). The goal being to distinguish between scenarios with high and low Ascii character content.

The result: For DMD, auto-decoding showed a 1.6x to 2.6x cost. For LDC, a 12.2x to 12.9x cost.

Details:
- Test program: https://dpaste.dzfl.pl/67c7be11301f
- DMD 2.071.0. Options: -release -O -boundscheck=off -inline
- LDC 1.0.0-beta1 (based on DMD v2.070.2). Options: -release -O -boundscheck=off
- Machine: Macbook Pro (2.8 GHz Intel I7, 16GB ram)

Runs for each combination were done five times and the median times used. The median times and the char[] to ubyte[] ratio are below:
|          |           |    char[] |   ubyte[] |
| Compiler | Text type | time (ms) | time (ms) | ratio |
|----------+-----------+-----------+-----------+-------|
| DMD      | Latin     |      7261 |      4513 |   1.6 |
| DMD      | Non-latin |     10240 |      3928 |   2.6 |
| LDC      | Latin     |     11773 |       913 |  12.9 |
| LDC      | Non-latin |     10756 |       883 |  12.2 |

Note: The numbers above don't provide enough info to derive a front/popFront rate. The program artificially makes multiple loops to increase the run-times. (For these runs, the program's repeat-count was set to 20).

Characteristics of the two data sets:
|           |         |         |             | Bytes per |
| Text type |   Bytes |  DChars | Ascii Chars |     DChar | Pct Ascii |
|-----------+---------+---------+-------------+-----------+-----------|
| Latin     | 4156697 | 4059016 |     3965585 |     1.024 |     97.7% |
| Non-latin | 4061554 | 1949290 |      348164 |     2.084 |     17.9% |

Run-to-run variability - The run times recorded were quite stable. The largest delta between minimum and median time for any group was 17 milliseconds.

May 16, 2016
On Sunday, 15 May 2016 at 23:10:38 UTC, Jon D wrote:
> Given the importance of performance in the auto-decoding topic, it seems reasonable to quantify it. I took a stab at this. It would of course be prudent to have others conduct similar analysis rather than rely on my numbers alone.

Here is another benchmark (see the above comment for the code to apply the patch to) that measures the iteration time difference: http://forum.dlang.org/post/ndj6dm$a6c$1@digitalmars.com

The result is a 756% slow down
May 15, 2016
On Mon, May 16, 2016 at 12:31:04AM +0000, Jack Stouffer via Digitalmars-d wrote:
> On Sunday, 15 May 2016 at 23:10:38 UTC, Jon D wrote:
> >Given the importance of performance in the auto-decoding topic, it seems reasonable to quantify it. I took a stab at this. It would of course be prudent to have others conduct similar analysis rather than rely on my numbers alone.
> 
> Here is another benchmark (see the above comment for the code to apply the patch to) that measures the iteration time difference: http://forum.dlang.org/post/ndj6dm$a6c$1@digitalmars.com
> 
> The result is a 756% slow down

I decide to do my own benchmarking too. Here's the code:

	/**
	 * Simple-minded benchmark for measuring performance degradation caused by
	 * autodecoding.
	 */

	import std.typecons : Flag, Yes, No;

	size_t countNewlines(Flag!"autodecode" autodecode)(const(char)[] input)
	{
	    size_t count = 0;

	    static if (autodecode)
	    {
	        import std.array;
	        foreach (dchar ch; input)
	        {
	            if (ch == '\n') count++;
	        }
	    }
	    else // !autodecode
	    {
	        import std.utf : byCodeUnit;
	        foreach (char ch; input.byCodeUnit)
	        {
	            if (ch == '\n') count++;
	        }
	    }
	    return count;
	}

	void main(string[] args)
	{
	    import std.datetime : benchmark;
	    import std.file : read;
	    import std.stdio : writeln, writefln;

	    string input = (args.length >= 2) ? args[1] : "/usr/src/d/phobos/std/datetime.d";

	    uint n = 50;
	    auto data = cast(char[]) read(input);
	    writefln("Input: %s (%d bytes)", input, data.length);
	    size_t count;

	    writeln("With autodecoding:");
	    auto result = benchmark!({
	        count = countNewlines!(Yes.autodecode)(data);
	    })(n);
	    writefln("Newlines: %d  Time: %s msecs", count, result[0].msecs);

	    writeln("Without autodecoding:");
	    result = benchmark!({
	        count = countNewlines!(No.autodecode)(data);
	    })(n);
	    writefln("Newlines: %d  Time: %s msecs", count, result[0].msecs);
	}

	// vim:set sw=4 ts=4 et:

Just for fun, I decided to use std/datetime.d, one of the largest modules in Phobos, as a test case.

For comparison, I compiled with dmd (latest git head) and gdc 5.3.1. The
compile commands were:

	dmd -O -inline bench.d -ofbench.dmd
	gdc -O3 bench.d -o bench.gdc

Here are the results from bench.dmd:

	Input: /usr/src/d/phobos/std/datetime.d (1464089 bytes)
	With autodecoding:
	Newlines: 35398  Time: 331 msecs
	Without autodecoding:
	Newlines: 35398  Time: 254 msecs

And the results from bench.gdc:

	Input: /usr/src/d/phobos/std/datetime.d (1464089 bytes)
	With autodecoding:
	Newlines: 35398  Time: 253 msecs
	Without autodecoding:
	Newlines: 35398  Time: 25 msecs

These results are pretty typical across multiple runs. There is a variance of about 20 msecs or so between bench.dmd runs, but the bench.gdc runs vary only by about 1-2 msecs.

So for bench.dmd, autodecoding adds about a 30% overhead to running time, whereas for bench.gdc, autodecoding costs an order of magnitude increase in running time.

As an interesting aside, compiling with dmd without -O -inline causes the non-autodecoding case to be actually consistently *slower* than the autodecoding case. Apparently in this case the performance is dominated by the cost of calling non-inlined range primitives on byCodeUnit, whereas a manual for-loop over the array of chars produces similar results to the -O -inline case.  I find this interesting, because it shows that the cost of autodecoding is relatively small compared to the cost of unoptimized range primitives.  Nevertheless, it does make a big difference when range primitives are properly optimized.  It is especially poignant in the case of gdc that, given a superior optimizer, the non-autodecoding case can be made an order of magnitude faster, whereas the autodecoding case is presumably complex enough to defeat the optimizer.


T

-- 
Democracy: The triumph of popularity over principle. -- C.Bond
May 16, 2016
On Sunday, 15 May 2016 at 23:10:38 UTC, Jon D wrote:
>
> Runs for each combination were done five times and the median times used. The median times and the char[] to ubyte[] ratio are below:
> |          |           |    char[] |   ubyte[] |
> | Compiler | Text type | time (ms) | time (ms) | ratio |
> |----------+-----------+-----------+-----------+-------|
> | DMD      | Latin     |      7261 |      4513 |   1.6 |
> | DMD      | Non-latin |     10240 |      3928 |   2.6 |
> | LDC      | Latin     |     11773 |       913 |  12.9 |
> | LDC      | Non-latin |     10756 |       883 |  12.2 |
>

Interesting that LDC is slower than DMD for char[].
May 17, 2016
On Friday, 13 May 2016 at 21:46:28 UTC, Jonathan M Davis wrote:
> The history of why UTF-16 was chosen isn't really relevant to my point (Win32 has the same problem as Java and for similar reasons).
>
> My point was that if you use UTF-8, then it's obvious _really_ fast when you screwed up Unicode-handling by treating a code unit as a character, because anything beyond ASCII is going to fall flat on its face.

On the other hand if you deal with UTF-16 text, you can't interpret it in a way other than UTF-16, people either get it correct or give up, even for ASCII, even with casts, it's that resilient. With UTF-8 problems happened on a massive scale in LAMP setups: mysql used latin1 as a default encoding and almost everything worked fine.
May 17, 2016
On Tuesday, 17 May 2016 at 09:53:17 UTC, Kagamin wrote:
> With UTF-8 problems happened on a massive scale in LAMP setups: mysql used latin1 as a default encoding and almost everything worked fine.

^ latin-1 with Swedish collation rules.
And even if you set the encoding to "utf8", almost everything works fine until you discover that you need to set the encoding to "utf8mb4" to get real utf8.  Also, MySQL has per-connection character encoding settings, so even if your application is properly set up to use utf8, you can break things by accidentally connecting with a client using the default pretty-much-latin1 encoding.  With MySQL's "silently ram the square peg into the round hole" design philosophy, this can cause data corruption.

But, of course, almost everything works fine.

Just some examples of why broken utf8 exists (and some venting of MySQL trauma).
May 26, 2016
This might be a good time to discuss this a tad further. I'd appreciate if the debate stayed on point going forward. Thanks!

My thesis: the D1 design decision to represent strings as char[] was disastrous and probably one of the largest weaknesses of D1. The decision in D2 to use immutable(char)[] for strings is a vast improvement but still has a number of issues. The approach to autodecoding in Phobos is an improvement on that decision. The insistent shunning of a user-defined type to represent strings is not good and we need to rid ourselves of it.

On 05/12/2016 04:15 PM, Walter Bright wrote:
> On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
>  > I am as unclear about the problems of autodecoding as I am about the
> necessity
>  > to remove curl. Whenever I ask I hear some arguments that work well
> emotionally
>  > but are scant on reason and engineering. Maybe it's time to rehash
> them? I just
>  > did so about curl, no solid argument seemed to come together. I'd be
> curious of
>  > a crisp list of grievances about autodecoding. -- Andrei
>
> Here are some that are not matters of opinion.
>
> 1. Ranges of characters do not autodecode, but arrays of characters do.
> This is a glaring inconsistency.

Agreed. At the point of that decision, the party line was "arrays of characters are strings, nothing else is or should be". Now it is apparent that shouldn't have been the case.

> 2. Every time one wants an algorithm to work with both strings and
> ranges, you wind up special casing the strings to defeat the
> autodecoding, or to decode the ranges. Having to constantly special case
> it makes for more special cases when plugging together components. These
> issues often escape detection when unittesting because it is convenient
> to unittest only with arrays.

This is a consequence of 1. It is at least partially fixable.

> 3. Wrapping an array in a struct with an alias this to an array turns
> off autodecoding, another special case.

This is also a consequence of 1.

> 4. Autodecoding is slow and has no place in high speed string processing.

I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D. Also, little code should deal with one code unit or code point at a time; instead, it should use standard library algorithms for searching, matching etc. When needed, iterating every code unit is trivially done through indexing.

Also allow me to point that much of the slowdown can be addressed tactically. The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore easily speculated. We can and we should arrange code to minimize impact.

> 5. Very few algorithms require decoding.

The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug):

s.find("abc")
s.findSplit("abc")
s.findSplit('a')
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.

> 6. Autodecoding has two choices when encountering invalid code units -
> throw or produce an error dchar. Currently, it throws, meaning no
> algorithms using autodecode can be made nothrow.

Agreed. This is probably the most glaring mistake. I think we should open a discussion no fixing this everywhere in the stdlib, even at the cost of breaking code.

> 7. Autodecode cannot be used with unicode path/filenames, because it is
> legal (at least on Linux) to have invalid UTF-8 as filenames. It turns
> out in the wild that pure Unicode is not universal - there's lots of
> dirty Unicode that should remain unmolested, and autocode does not play
> with that.

If paths are not UTF-8, then they shouldn't have string type (instead use ubyte[] etc). More on that below.

> 8. In my work with UTF-8 streams, dealing with autodecode has caused me
> considerably extra work every time. A convenient timesaver it ain't.

Objection. Vague.

> 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
> importing std.array one way or another, and then autodecode is there.

Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)

> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
> benefit of being arrays in the first place.

First off, you always have the option with .representation. That's a great name because it gives you the type used to represent the string - i.e. an array of integers of a specific width.

Second, it's as it should. The entire scaffolding rests on the notion that char[] is distinguished from ubyte[] by having UTF8 code units, not arbitrary bytes. It seems that many arguments against autodecoding are in fact arguments in favor of eliminating virtually all distinctions between char[] and ubyte[]. Then the natural question is, what _is_ the difference between char[] and ubyte[] and why do we need char as a separate type from ubyte?

This is a fundamental question for which we need a rigorous answer. What is the purpose of char, wchar, and dchar? My current understanding is that they're justified as pretty much indistinguishable in primitives and behavior from ubyte, ushort, and uint respectively, but they reflect a loose subjective intent from the programmer that they hold actual UTF code units. The core language does not enforce such, except it does special things in random places like for loops (any other)?

If char is to be distinct from ubyte, and char[] is to be distinct from ubyte[], then autodecoding does the right thing: it makes sure they are distinguished in behavior and embodies the assumption that char is, in fact, a UTF8 code point.

> 11. Indexing an array produces different results than autodecoding,
> another glaring special case.

This is a direct consequence of the fact that string is immutable(char)[] and not a specific type. That error predates autodecoding.

Overall, I think the one way to make real steps forward in improving string processing in the D language is to give a clear answer of what char, wchar, and dchar mean.


Andrei

May 26, 2016
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote:
> instead, it should use standard library algorithms for searching,
> matching etc. When needed, iterating every code unit is trivially
> done through indexing.

For an example where the std.algorithm/range functions don't cut it, my random format date string parser first breaks up the given character range into tokens. Once it has the tokens, it checks several known formats. One piece of that is checking if some of the tokens are in AAs of month and day names for fast tests of presence. Because the AAs are int[string], and it's unknowable the encoding of string (it's complicated), during tokenization, the character range must be forced to UTF-8 with byChar with all isSomeString!R == true inputs to avoid the auto-decoding and subsequent AA key mismatch.

> Agreed. This is probably the most glaring mistake. I think we should open a discussion no fixing this everywhere in the stdlib, even at the cost of breaking code.

See the discussion here: https://issues.dlang.org/show_bug.cgi?id=14519

I think some of the proposals there are interesting.

> Overall, I think the one way to make real steps forward in improving string processing in the D language is to give a clear answer of what char, wchar, and dchar mean.

If you agree that iterating over code units and code points isn't what people want/need most of the time, then I will quote something from my article on the subject:

"I really don't see the benefit of the automatic behavior fulfilling this one specific corner case when you're going to make everyone else call a range generating function when they want to iterate over code units or graphemes. Just make everyone call a range generating function to specify the type of iteration and save a lot of people the trouble!"

I think the only clear way forward is to not make strings ranges and force people to make a decision when passing them to range functions. The HUGE problem is the code this will break, which is just about all of it.