February 08, 2014
08-Feb-2014 02:57, Jonathan M Davis пишет:
> On Friday, February 07, 2014 20:43:38 Dmitry Olshansky wrote:
>> 07-Feb-2014 20:29, Andrej Mitrovic пишет:
>>> On Friday, 7 February 2014 at 16:27:35 UTC, Andrei Alexandrescu wrote:
>>>> Add a bugzilla and let's define isValid that returns bool!
>>>
>>> Add std.utf.decode() to that as well. IOW, it should have an overload
>>> which returns a status code
>>
>> Much simpler - it returns a special dchar to designate bad encoding. And
>> there is one defined by Unicode spec.
>
> Isn't that actually worse?

No, it's better and more flexible for those who care to repair broken text in case it's broken. We currently have ZERO facilities to work with partly broken UTF and it's not that rare thing to have it.

> Unless you're suggesting that we stop throwing on
> decode errors,

That is exactly what I suggest.

then functions like std.array.front will have to check the
> result on every call to see whether it was valid or not and thus whether they
> should throw, which would mean extra overhead over simply having decode throw
> on decode errors.

Why the heck? It will not throw either. In the very end bad encoding is handled by displaying the 'substituted' (typically '?') character in places where it broke not by throwing up hands in the air and spitting "UTF Exception: offset 4302 bad UTF sequence". This is not good enough (in case somebody though that it is).

Those who care about throwing add a trivial map!(x => x != '\uFFFD' || die()) over a string, where die function throws an exception.

> validate has no business throwing, and we definitely should
> add isValidUnicode (or isValid or whatever you want to call it) for validation
> purposes. Code can then call that to validate that a string is valid and not
> worry about any UTFExceptions being thrown as long as it doesn't manipulate
> the string in a way that could result in its Unicode becoming invalid.

Yet later down the road decode will triple check that anyway. Just saying. BTW if the string was checked beforehand there is no difference between 2 approaches at all (don't have to check).

> However, I would argue that assuming that everyone is going to validate their
> strings and that pretty much all string-related functions shouldn't ever have
> to worry about invalid Unicode is just begging for subtle bugs all over the
> place IMHO. You're essentially dealing with error codes at that point, and I
> think that experience has shown quite clearly that error codes are generally a
> bad way to go. Almost no one checks them unless they have to. I think that
> having decode throw on invalid Unicode is exactly what it should be doing. The
> problem is that validate shouldn't.

Every single text editor out there seems to disagree with you: they do show you partially substituted text, not a dialog box "My bad, it's broken UTF-8, I'm giving up!".

-- 
Dmitry Olshansky
February 08, 2014
08-Feb-2014 03:01, Meta пишет:
> On Friday, 7 February 2014 at 22:57:26 UTC, Jonathan M Davis wrote:
>> On Friday, February 07, 2014 20:43:38 Dmitry Olshansky wrote:
> You could always return an Option!char. Nullable won't work because it
> lets you access the naked underlying value.

This is ridiculously distracting suggestion and simply has no merits whatsoever.

To underline how impractical this suggestion is: currently every code out there expect dchar out of .front not some magic animal called 'Option!char'.

-- 
Dmitry Olshansky
February 08, 2014
On Saturday, February 08, 2014 11:17:25 Jakob Ovrum wrote:
> On Saturday, 8 February 2014 at 11:05:38 UTC, Dmitry Olshansky wrote:>
> 
> > If both are thread-local and cached I see no problem whatsoever.
> > The thing is the current "default" of creating exception is
> > AWFUL.
> > And D stands for sane defaults and the simple path being good
> > last time I checked.
> 
> How is it not a problem? XException's fields (message, location etc) would be overwritten by the latest throw site, and its `next` field would point to itself.

Then we have multiple of them, or we new up another one when a second one is needed. Even if it were only the first exception which avoided the allocation, it would be a big gain, and in most cases, you're only going to get a single exception, or the exceptions will be of different types.

- Jonathan M Davis
February 08, 2014
08-Feb-2014 09:45, Jonathan M Davis пишет:
> On Friday, February 07, 2014 21:04:08 Jonathan M Davis wrote:
> Actually, thinking this through some more, if we can replace invalid Unicode
> with 0xFFFD, and have all algorithms work with that and consider it valid
> Unicode (rather than getting weird bugs due to invalid Unicode), then if
> decode returned that on error rather than throwing, we wouldn't actually need
> to check the return value. It wouldn't matter that the Unicode was invalid.
> So, we wouldn't even need to _care_ that the Unicode was invalid. Anyone who
> _did_ care could call isValidUnicode to validate the Unicode first, and those
> who didn't wouldn't need to worry about UTFException being thrown, because
> everything would still work even if the string was invalid Unicode.

Hm.. yes. I gotta read the whole thread next time :)


> So, if that's indeed what 0xFFFD does, and that's what Dmitry meant by
> proposing that we return that rather than throwing, then I rescind my
> assessment that throwing was the best way to go and have to agree that
> returning 0xFFFD would be better. I was responding under the assumption that
> you had to check for 0xFFFD and respond to it order to avoid having your code
> be buggy, in which case throwing would be far better. But if 0xFFFD is
> considered valid Unicode,

It is.

> then returning that would be a fantastic solution.
> And if that's the case, we only need two functions, not three:
>
> 1. decode, which returns 0xFFFD on decode failure
>
> 2. isValidUnicode, which returns whether the string is valid
>

Yay.

> And I actually really like the idea that we could just operate on invalid
> Unicode as valid Unicode this way, making it so that most code doesn't need to
> care, and code that _does_ need to care, can validate the strings first. Right
> now, pretty much all string code needs to care in order to avoid processing
> invalid Unicode, which is much messier.
>
Horray! The goodness is that for example I can run regex on partially broken text and have some sane results out of it.

> - Jonathan M Davis
>


-- 
Dmitry Olshansky
February 08, 2014
08-Feb-2014 12:20, Andrej Mitrovic пишет:
> On 2/7/14, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
>> However, I would argue that assuming that everyone is going to validate
>> their
>> strings and that pretty much all string-related functions shouldn't ever
>> have
>> to worry about invalid Unicode is just begging for subtle bugs all over the
>>
>> place IMHO.
>
> I suggested we would introduce an overload, not replace the existing
> function, so this isn't an issue.
>
>> The problem is that you need to check it. This is _slower_ than exceptions in
> the normal case, as invalid Unicode should be the rare case.
>
> Do you have any benchmarks for this? I have vague memory about
> complaining that the exception code is *de-facto* slower, regardless
> of input. But I'll try to provide some test-cases later and see where
> we're at.
>

Just be sure to test on LDC or GDC. DMD results are irrelevant to the performance-minded of our community. Also be sure to copy the whole code involved in a single file not link to Phobos.

People tend to thrown figures like ~10% slower with exceptions turned on but you'll never known what exactly they test.

-- 
Dmitry Olshansky
February 08, 2014
On Saturday, 8 February 2014 at 11:27:27 UTC, Jonathan M Davis wrote:
> On Saturday, February 08, 2014 11:17:25 Jakob Ovrum wrote:
>> On Saturday, 8 February 2014 at 11:05:38 UTC, Dmitry Olshansky
>> wrote:>
>> 
>> > If both are thread-local and cached I see no problem whatsoever.
>> > The thing is the current "default" of creating exception is
>> > AWFUL.
>> > And D stands for sane defaults and the simple path being good
>> > last time I checked.
>> 
>> How is it not a problem? XException's fields (message, location
>> etc) would be overwritten by the latest throw site, and its
>> `next` field would point to itself.
>
> Then we have multiple of them, or we new up another one when a second one is
> needed. Even if it were only the first exception which avoided the allocation,
> it would be a big gain, and in most cases, you're only going to get a single
> exception, or the exceptions will be of different types.
>
> - Jonathan M Davis

Yes, I'm sure there is a cool solution, I'm just pointing out that it's not as simple as statically allocating.

I think it would be a nice exercise to compose such a solution with std.allocator.
February 08, 2014
On 2/8/14, 3:02 AM, Jakob Ovrum wrote:
> On Saturday, 8 February 2014 at 00:49:46 UTC, Andrei Alexandrescu wrote:
>> One simple idea is to statically allocate the same exception and
>> rethrow it over and over. After all there's no guarantee a distinct
>> exception is thrown every time, and the approach is still memory safe
>> (though it might surprise the programmer who saves a reference to an
>> old exception).
>>
>> Andrei
>
> I don't think it's that simple. What happens if an XException causes
> another XException and they need to be chained together?

The chaining method detects that and .dup's one of them.

Andrei

February 08, 2014
On Saturday, 8 February 2014 at 11:24:56 UTC, Dmitry Olshansky wrote:
> 08-Feb-2014 03:01, Meta пишет:
>> On Friday, 7 February 2014 at 22:57:26 UTC, Jonathan M Davis wrote:
>>> On Friday, February 07, 2014 20:43:38 Dmitry Olshansky wrote:
>> You could always return an Option!char. Nullable won't work because it
>> lets you access the naked underlying value.
>
> This is ridiculously distracting suggestion and simply has no merits whatsoever.
>
> To underline how impractical this suggestion is: currently every code out there expect dchar out of .front not some magic animal called 'Option!char'.

I'm not actually suggesting a replacement. Just wishful thinking on how the function could've been better designed.
February 08, 2014
On Saturday, 8 February 2014 at 16:50:53 UTC, Andrei Alexandrescu wrote:
> The chaining method detects that and .dup's one of them.
>
> Andrei

After some thinking I don't think it actually helps - exception will be modified _before_ throwing in library code so cloning will be to late.

But I don't see any reason why basic exception instances in Phobos can't be made immutable.
February 08, 2014
On Saturday, 8 February 2014 at 16:50:53 UTC, Andrei Alexandrescu wrote:
> On 2/8/14, 3:02 AM, Jakob Ovrum wrote:
>> On Saturday, 8 February 2014 at 00:49:46 UTC, Andrei Alexandrescu wrote:
>>> One simple idea is to statically allocate the same exception and
>>> rethrow it over and over. After all there's no guarantee a distinct
>>> exception is thrown every time, and the approach is still memory safe
>>> (though it might surprise the programmer who saves a reference to an
>>> old exception).
>>>
>>> Andrei
>>
>> I don't think it's that simple. What happens if an XException causes
>> another XException and they need to be chained together?
>
> The chaining method detects that and .dup's one of them.
>
> Andrei

What if the statically allocated XException is escaped to be inspected later, but before that is thrown again in a separate exception chain?

I suppose it would be no different from the current situation, as it's legal to throw exceptions allocated in any fashion, so there is already no guarantee of uniqueness. It's probable that some code out there still takes exception uniqueness for granted, so changing the allocation scheme would be a (typically silent) breaking change, even if the code is arguably broken in the first place. I suppose we could make that breakage a compile error by making exceptions implicitly `scope` at the catch-site, but that would of course be a much more involved change...

Personally I still like the idea, but if implemented, I think something should be done about the change in uniqueness at the same time, even if it's just an added note in the language documentation on exceptions.