Unicode handling comparison (page 5)

November 28, 2013

Re: Unicode handling comparison

Posted by H. S. Teoh
in reply to Walter Bright

Permalink

H. S. Teoh

Posted in reply to Walter Bright

Permalink

On Thu, Nov 28, 2013 at 09:52:08AM -0800, Walter Bright wrote:
> On 11/28/2013 5:24 AM, monarch_dodra wrote:
> >Which operations are you thinking of in std.array that decode when they shouldn't?
> 
> front() in std.array looks like:
> 
> @property dchar front(T)(T[] a) @safe pure if (isNarrowString!(T[]))
> {
>     assert(a.length, "Attempting to fetch the front of an empty
> array of " ~ T.stringof);
>     size_t i = 0;
>     return decode(a, i);
> }
> 
> So anytime I write a generic algorithm using empty, front, and popFront(), it decodes the strings, which is a large pessimization.

OTOH, it is actually correct by default. If it *didn't* decode, things like std.algorithm.sort and std.range.retro would mangle all your multibyte UTF-8 characters.

Having said that, though, it would be nice if there were a standard ASCII string type that didn't decode by default. Always decoding strings *is* slow, esp. when you already know that it only contains ASCII characters. Maybe we want something like this:

	struct AsciiString {
		immutable(ubyte)[] impl;
		alias impl this;

		// This is so that .front returns char instead of ubyte
		@property char front() { return cast(char) impl[0]; }

		char opIndex(size_t idx) { ... /* ditto */ }

		... // other range methods here
	}

	AsciiString assumeAscii(string s)
	{
		return AsciiString(cast(immutable(ubyte)[]) s);
	}


T

-- 
"640K ought to be enough" -- Bill G., 1984.
"The Internet is not a primary goal for PC usage" -- Bill G., 1995.
"Linux has no impact on Microsoft's strategy" -- Bill G., 1999.

On Thursday, 28 November 2013 at 18:55:44 UTC, Dicebot wrote: > http://dlang.org/phobos/std_encoding.html#.AsciiString ? Yeah, that or just ubyte[]. The problem with both of these though, is printing :/ (which prints ugly as sin) Something like: struct AsciiChar { private char c; alias c this; } Could be a very easy and efficient alternative.

28-Nov-2013 17:24, monarch_dodra пишет: > On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright > wrote: >> Sadly, > > I think it's great. It means by default, your strings will always > be handled correctly. I think there's quite a few algorithms that > were written without ever taking strings into account, but still > happen to work with them. > The greatest problem is surprisingly that you can't use range functions to the implicit codeunit range even if you REALLY wanted to. To not go far away - the only reason std.regex can't take e.g. retro of string: match(retro("hleb), ".el."); is because of the automatic dumbing down at the moment you apply range adapter. What I'd need in std.regex is a codeunit range that due to convention also "happens to be" a range of codepoints. The second problem is that string code is carefully special cased but the effort is completely wasted the moment you have a slice of char-s that come from anywhere else (circular buffer, for instance) then built-in strings. I had a (a bit cloudy) vision of settling encoded ranges problem once and for good. That includes defining notion of an encoded range that is 2 in one: some stronger (as in capabilities) range of code elements and the default decoded view imposed on top of it (that can be weaker). -- Dmitry Olshansky

On 11/28/2013 10:19 AM, H. S. Teoh wrote: > Always decoding strings > *is* slow, esp. when you already know that it only contains ASCII > characters. It doesn't have to be merely ASCII. You can do string substring searches without any need for decoding, for example. You don't even need decoding to do regex. Decoding is rarely needed.

On 11/28/2013 11:32 AM, Dmitry Olshansky wrote: > I had a (a bit cloudy) vision of settling encoded ranges problem once and for > good. That includes defining notion of an encoded range that is 2 in one: some > stronger (as in capabilities) range of code elements and the default decoded > view imposed on top of it (that can be weaker). I suspect the correct approach would be to have the range over string to produce bytes. If you want decoded values, then run it through an adapter algorithm.

On 11/27/2013 12:06 PM, Dmitry Olshansky wrote: > 27-Nov-2013 18:45, David Nadlinger пишет: >> As far as I'm aware, this behavior is the result of a deliberate >> decision, as normalizing strings on the fly isn't really cheap. > > It's anything but cheap. > At the minimum imagine crawling the string and issuing a table lookup per > codepoint. Decoding isn't cheap, either, which is why I rant about it being the default behavior.

On Wednesday, 27 November 2013 at 20:13:32 UTC, Dmitry Olshansky wrote: > I could have sworn we had byGrapheme somewhere, well apparently not :( Simple attempt: https://github.com/D-Programming-Language/phobos/pull/1736

Forums