February 08, 2014
Am Fri, 07 Feb 2014 22:42:00 -0500
schrieb "Jonathan M Davis" <jmdavisProg@gmx.com>:

> On Saturday, February 08, 2014 02:41:54 bearophile wrote:
> > Jonathan M Davis:
> > > The problem is that you need to check it. This is _slower_ than exceptions in the normal case,
> > 
> > Right, but verifying the correctness of the Unicode encoding of a string probably on average requires much more than time than testing a single conditional. So I think this tiny added time is acceptable.
> 
> But why even do it in the first place then? The code is cleaner and less error-prone if it uses exceptions. The only argument I can see being made for not using exceptions with decode is efficiency, because it's more cumbersome to use if it's returning error values of some kind rather than just throwing in the rare case that there's a Unicode decoding error. It's also more error- prone than using exceptions, because most code will just skip checking the result. That's one of the big reasons that error codes are generally a bad idea.
> 
> But since decode has to do the same validity checks whether it returns an invalid dchar or a Nullable!dchar or if it throws, I don't see why not having the exception buys us anything. It just makes the API worse.
> 
> - Jonathan M Davis

I agree with both of you. The Unicode standard tells us that
it is correct to replace invalid data with that special code
point, so it should be used where applicable, e.g. when one
sanitizes an invalid string.
On the other hand exceptions are clearly superior to error
returns.

I guess we just have two use cases here. One where invalid
encoding is not an error (e.g. for sanitizing purposes) and
one where you don't want to lose information and have to
enforce correct encoding.
Name the first one "decodeSubst" maybe and have decode call
that and check for 0xFFFD?

-- 
Marco

February 08, 2014
Am Sat, 8 Feb 2014 05:29:35 +0100
schrieb Marco Leise <Marco.Leise@gmx.de>:

> Name the first one "decodeSubst" maybe and have decode call that and check for 0xFFFD?

Err... the other way round. 0xFFFD would actually be valid
from an encoding point of view, I guess.

-- 
Marco

February 08, 2014
On Saturday, February 08, 2014 05:29:35 Marco Leise wrote:
> I guess we just have two use cases here. One where invalid
> encoding is not an error (e.g. for sanitizing purposes) and
> one where you don't want to lose information and have to
> enforce correct encoding.
> Name the first one "decodeSubst" maybe and have decode call
> that and check for 0xFFFD?

I think that that would call for us to have 3 related but distinct functions:

1. decode, which throws on invalid Unicode. We already have this.

2. isValidUnicode, which returns whether the string is valid Unicode and does not throw. We don't yet have this. Rather, we have validate which does the same job and then throws instead of returning bool.

3. sanitizeUnicode (or whatever would be a good name for it), which replaces invalid Unicode with 0xFFFD (or whatever the appropriate character is) so that it can be operated on without causing decode to throw in spite of the fact that it was invalid Unicode. We don't have anything like this yet.

- Jonathan M Davis
February 08, 2014
On Friday, February 07, 2014 21:04:08 Jonathan M Davis wrote:
> On Saturday, February 08, 2014 05:29:35 Marco Leise wrote:
> > I guess we just have two use cases here. One where invalid
> > encoding is not an error (e.g. for sanitizing purposes) and
> > one where you don't want to lose information and have to
> > enforce correct encoding.
> > Name the first one "decodeSubst" maybe and have decode call
> > that and check for 0xFFFD?
> 
> I think that that would call for us to have 3 related but distinct functions:
> 
> 1. decode, which throws on invalid Unicode. We already have this.
> 
> 2. isValidUnicode, which returns whether the string is valid Unicode and does not throw. We don't yet have this. Rather, we have validate which does the same job and then throws instead of returning bool.
> 
> 3. sanitizeUnicode (or whatever would be a good name for it), which replaces invalid Unicode with 0xFFFD (or whatever the appropriate character is) so that it can be operated on without causing decode to throw in spite of the fact that it was invalid Unicode. We don't have anything like this yet.

Actually, thinking this through some more, if we can replace invalid Unicode with 0xFFFD, and have all algorithms work with that and consider it valid Unicode (rather than getting weird bugs due to invalid Unicode), then if decode returned that on error rather than throwing, we wouldn't actually need to check the return value. It wouldn't matter that the Unicode was invalid. So, we wouldn't even need to _care_ that the Unicode was invalid. Anyone who _did_ care could call isValidUnicode to validate the Unicode first, and those who didn't wouldn't need to worry about UTFException being thrown, because everything would still work even if the string was invalid Unicode.

So, if that's indeed what 0xFFFD does, and that's what Dmitry meant by proposing that we return that rather than throwing, then I rescind my assessment that throwing was the best way to go and have to agree that returning 0xFFFD would be better. I was responding under the assumption that you had to check for 0xFFFD and respond to it order to avoid having your code be buggy, in which case throwing would be far better. But if 0xFFFD is considered valid Unicode, then returning that would be a fantastic solution. And if that's the case, we only need two functions, not three:

1. decode, which returns 0xFFFD on decode failure

2. isValidUnicode, which returns whether the string is valid

And I actually really like the idea that we could just operate on invalid Unicode as valid Unicode this way, making it so that most code doesn't need to care, and code that _does_ need to care, can validate the strings first. Right now, pretty much all string code needs to care in order to avoid processing invalid Unicode, which is much messier.

- Jonathan M Davis
February 08, 2014
Am Fri, 07 Feb 2014 21:04:08 -0800
schrieb Jonathan M Davis <jmdavisProg@gmx.com>:

> On Saturday, February 08, 2014 05:29:35 Marco Leise wrote:
> > I guess we just have two use cases here. One where invalid
> > encoding is not an error (e.g. for sanitizing purposes) and
> > one where you don't want to lose information and have to
> > enforce correct encoding.
> > Name the first one "decodeSubst" maybe and have decode call
> > that and check for 0xFFFD?
> 
> I think that that would call for us to have 3 related but distinct functions:
> 
> 1. decode, which throws on invalid Unicode. We already have this.
>
> 2. isValidUnicode, which returns whether the string is valid Unicode and does not throw. We don't yet have this. Rather, we have validate which does the same job and then throws instead of returning bool.

Yes, that's the one that needs to be added.

> 3. sanitizeUnicode (or whatever would be a good name for it), which replaces invalid Unicode with 0xFFFD (or whatever the appropriate character is) so that it can be operated on without causing decode to throw in spite of the fact that it was invalid Unicode. We don't have anything like this yet.

And oh wonder, we actually have that already! Problem solved:
http://dlang.org/phobos/std_encoding.html#.sanitize
(Not that I knew that before hand *cough*)

Or does someone have a need to also sanitize code point by code point?

> - Jonathan M Davis

-- 
Marco

February 08, 2014
On 2/7/14, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> However, I would argue that assuming that everyone is going to validate
> their
> strings and that pretty much all string-related functions shouldn't ever
> have
> to worry about invalid Unicode is just begging for subtle bugs all over the
>
> place IMHO.

I suggested we would introduce an overload, not replace the existing function, so this isn't an issue.

> The problem is that you need to check it. This is _slower_ than exceptions in
the normal case, as invalid Unicode should be the rare case.

Do you have any benchmarks for this? I have vague memory about complaining that the exception code is *de-facto* slower, regardless of input. But I'll try to provide some test-cases later and see where we're at.
February 08, 2014
On Saturday, February 08, 2014 09:20:15 Andrej Mitrovic wrote:
> On 2/7/14, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> > However, I would argue that assuming that everyone is going to validate
> > their
> > strings and that pretty much all string-related functions shouldn't ever
> > have
> > to worry about invalid Unicode is just begging for subtle bugs all over
> > the
> > 
> > place IMHO.
> 
> I suggested we would introduce an overload, not replace the existing function, so this isn't an issue.
> 
> > The problem is that you need to check it. This is _slower_ than exceptions in
> the normal case, as invalid Unicode should be the rare case.
> 
> Do you have any benchmarks for this? I have vague memory about complaining that the exception code is *de-facto* slower, regardless of input. But I'll try to provide some test-cases later and see where we're at.

The exception version has to all of the same checks that the version which returns an error value would have to do, while the one returning an error value which had to be checked for validity would have an extra check. So, the only ways that the exception version would be slower are if the plumbing for being able to throw an exception from the function makes it slower (assuming that the other would be nothrow) or if the optimizer just does worse with the exception one for some reason. Because the number of operations that the actual D code would be doing in the successful case would be greater for the non-throwing version. Code generation can do entertaining things to efficiency though, so benchmarking would be required to see what would actually happen.

However, as I stated in another post, I've reconsidered the situation. I think that I misunderstood what Dmitry was suggesting and that checking the error value is not actually necessary:

http://forum.dlang.org/post/mailman.66.1391838333.21734.digitalmars-d@puremagic.com

And if that's the case, then we can probably move towards having decode not throw and possibly getting rid of UTFException altogether (certainly, most code wouldn't throw it or have to worry about it, since decode and stride are the two  main cases where that's a concern, and if they don't throw anymore, then UTFException would have very little use).

- Jonathan M Davis
February 08, 2014
On Saturday, 8 February 2014 at 00:49:46 UTC, Andrei Alexandrescu wrote:
> One simple idea is to statically allocate the same exception and rethrow it over and over. After all there's no guarantee a distinct exception is thrown every time, and the approach is still memory safe (though it might surprise the programmer who saves a reference to an old exception).
>
> Andrei

I don't think it's that simple. What happens if an XException causes another XException and they need to be chained together?
February 08, 2014
08-Feb-2014 15:02, Jakob Ovrum пишет:
> On Saturday, 8 February 2014 at 00:49:46 UTC, Andrei Alexandrescu wrote:
>> One simple idea is to statically allocate the same exception and
>> rethrow it over and over. After all there's no guarantee a distinct
>> exception is thrown every time, and the approach is still memory safe
>> (though it might surprise the programmer who saves a reference to an
>> old exception).
>>
>> Andrei
>
> I don't think it's that simple. What happens if an XException causes
> another XException and they need to be chained together?

If both are thread-local and cached I see no problem whatsoever.
The thing is the current "default" of creating exception is AWFUL.
And D stands for sane defaults and the simple path being good last time I checked.

-- 
Dmitry Olshansky
February 08, 2014
On Saturday, 8 February 2014 at 11:05:38 UTC, Dmitry Olshansky wrote:>
> If both are thread-local and cached I see no problem whatsoever.
> The thing is the current "default" of creating exception is AWFUL.
> And D stands for sane defaults and the simple path being good last time I checked.

How is it not a problem? XException's fields (message, location etc) would be overwritten by the latest throw site, and its `next` field would point to itself.