September 29, 2014
Am Sun, 28 Sep 2014 16:21:14 -0700
schrieb Walter Bright <newshound2@digitalmars.com>:

> On 9/28/2014 1:39 PM, H. S. Teoh via Digitalmars-d wrote:
> > The problem with pulling such PRs is that they introduce a dichotomy into Phobos. Some functions autodecode, some don't, and from a user's POV, it's completely arbitrary and random. Which leads to bugs because people can't possibly remember exactly which functions autodecode and which don't.
> 
> That's ALREADY the case, as I explained to bearophile.
> 
> The solution is not to have the ranges autodecode, but to have the ALGORITHMS decide to autodecode (if they need it) or not (if they don't).

Yes, that sounds like the right abstraction!

-- 
Marco

September 29, 2014
I refuse to accept any code gen complaints based on DMD. It's optimization facilities are generally crappy compared to gdc / ldc and not worth caring about - it is just a reference implementation after all. Clean and concise library code is more important.

Now if the same inlining failure happens with other two compilers - that is something worth talking about (I don't know if it happens)
September 29, 2014
On Sunday, 28 September 2014 at 23:21:15 UTC, Walter Bright wrote:
> On 9/28/2014 1:39 PM, H. S. Teoh via Digitalmars-d wrote:
>>> It can work just fine, and I wrote it. The problem is convincing
>>> someone to pull it :-( as the PR was closed and reopened with
>>> autodecoding put back in.
>>
>> The problem with pulling such PRs is that they introduce a dichotomy
>> into Phobos. Some functions autodecode, some don't, and from a user's
>> POV, it's completely arbitrary and random. Which leads to bugs because
>> people can't possibly remember exactly which functions autodecode and
>> which don't.
>
> That's ALREADY the case, as I explained to bearophile.
>
> The solution is not to have the ranges autodecode, but to have the ALGORITHMS decide to autodecode (if they need it) or not (if they don't).

No it isn't, despite you pretending otherwise. Right now there is a simple rule - Phobos does auto-decoding everywhere and any failure to do so is considered a bug. Sometimes it is possible to bypass decoding for speed-up while preserving semantical correctness but it is always implementation detail, from the point of view of the API it can't be noticed (for valid unicode string at least).

Your proposal would have been a precedent to adding _intetional_ exception. It is unacceptable.
September 29, 2014
Am Sun, 28 Sep 2014 16:48:53 -0700
schrieb Walter Bright <newshound2@digitalmars.com>:

> Regardless, the replacement character method is widely used and accepted practice. There's no reason to throw.

I feel a bit uneasy about this. Could it introduce a silent loss of information? While the replacement character method is widely used, so is the error method. APIs typically provide flags for this.

MultiByteToWideChar: The flag MB_ERR_INVALID_CHARS decides
     whether the API errors out or drops invalid chars.
ICU: You set up an "error callback". The default replaces
     invalid characters with the Unicode substitution
     character. (We are talking about characters from
     arbitrary charsets like Amiga to Unicode.)
     Other prefab error handlers drop the invalid character or
     error out.
iconv: By default it errors out at the location where an
     incomplete or invalid sequence is detected. With the
     "//IGNORE" flag, it will silently drop invalid characters.

I'm not opposed to the decision, but I expected the reasoning
to me more along the line of:
`string` is per definition correct UTF-8. Exception or
substitution character is of no concern to a correctly
written D program, because decoding errors wont happen.
Validate and treat all input as ubyte[]. (Especially when
coming from a Windows console)
or:
We may lose information in the conversion, but it's the only
practical way to reach the @nogc goal. And we are far from
having reference-counted Exceptions.
instead of:
Many people use the substitution character [in unspecified
context], so it follows that it can replace Exceptions for
Phobos' string-dchar decoding. :)

-- 
Marco

September 29, 2014
On Sunday, 28 September 2014 at 23:06:28 UTC, Walter Bright wrote:
> Note that autodecode does not always happen - it doesn't happen for ranges of chars. It's very hard to look at piece of code and tell if autodecode is going to happen or not.

Arguably, this means we need to unify the behavior of strings, and "string-like" objects. Pointing to an inconsistency doesn't mean the design is flawed and void.
September 29, 2014
On Sunday, 28 September 2014 at 23:21:15 UTC, Walter Bright wrote:
> It's very simple for an algorithm to decode if it needs to, it just adds in a .byDchar adapter to its input range. Done. No special casing needed. The lines of code written drop in half. And it works with both arrays of chars, arrays of dchars, and input ranges of either.

This just misses the *entire* familly of algorithms that operate on generic types, such as "map". EG: the totality of std.algorithm. Oops.
September 29, 2014
On Sunday, 28 September 2014 at 23:48:54 UTC, Walter Bright wrote:
> I think it was you that suggested that instead of throwing on invalid UTF, that the replacement character be used instead? Or maybe not, I'm not quite sure.
>
> Regardless, the replacement character method is widely used and accepted practice. There's no reason to throw.

This I'm OK to stand stand behind as acceptable change (should we decide to go with). It will kill the "auto-decode throws and uses the GC" argument.
September 29, 2014
On Sunday, 28 September 2014 at 23:06:28 UTC, Walter Bright wrote:
> It's very hard to disable the autodecode when it is not needed, though the new .byCodeUnit has made that much easier.

One issue with this though is that "byCodeUnit" is not actually an array. As such, by using "byCodeUnit", you have just as much chances of improving performance, as you have of *hurting* performance for algorithms that are string optimized.

For example, which would be fastest:
"hello world".find(' '); //(1)
"hello world".byCodeUnit.find(' '); //(2)

Currently, (1) is faster :/

This is a good argument though to instead use ubyte[] or std.encoding.AsciiString.

What I think we (maybe) need though is std.encoding.UTF8Array, which explicitly means: This is a range containing UTF8 characters. I don't want decoding. It's an array you may memchr or slice operate on.
September 29, 2014
Am Mon, 29 Sep 2014 04:33:08 +0300
schrieb ketmar via Digitalmars-d <digitalmars-d@puremagic.com>:

> On Sun, 28 Sep 2014 19:44:39 +0000
> Uranuz via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> 
> > I speaking language which graphemes coded by 2 bytes
> UCS-4? KOI8? my locale is KOI8, and i HATE D for assuming that everyone
> one the planet using UTF-8 and happy with it. from my POV, almost all
> string decoding is broken. string i got from filesystem? good god, lest
> it not contain anything out of ASCII range! string i got from text
> file? the same. string i must write to text file or stdout? oh, 'cmon,
> what do you mean telling me "п©я─п╦п╡п╣я┌"?! i can't read that!

My friend, we agree here! We must convert the whole world to
UTF-8 eventually to end this madness! But for now when we
write to a terminal, we have to convert to the system locale,
because there are still people who don't use Unicode. (On
Windows consoles the wide-char writing functions are good
enough for NFC strings.)
And a path from the filesystem is actually in no specific
encoding on Unix. We only know it is byte based and uses ASCII
'/' and '\0' as delimiters. On Windows it is ushort based IIRC.
To make matters more messy, Gtk assumes Unicode, while Qt
assumes the user's locale for file names. And in reality it is
determined by the IO charset at mount time.

-- 
Marco


September 29, 2014
Am Sun, 28 Sep 2014 11:08:13 -0700
schrieb Walter Bright <newshound2@digitalmars.com>:

> On 9/28/2014 5:06 AM, Uranuz wrote:
> > A question: can you list some languages that represent UTF-8 narrow strings as array of single bytes?
> 
> C and C++.

Not really, C string are in "C locale", not a specific encoding. I.e. it _cannot_ deal with UTF-8 specifically. Assuming UTF-8 would lead to funny output on consoles and the like when passing D string to C functions. :)

-- 
Marco