November 28, 2013 Re: Unicode handling comparison | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Thu, Nov 28, 2013 at 09:52:08AM -0800, Walter Bright wrote: > On 11/28/2013 5:24 AM, monarch_dodra wrote: > >Which operations are you thinking of in std.array that decode when they shouldn't? > > front() in std.array looks like: > > @property dchar front(T)(T[] a) @safe pure if (isNarrowString!(T[])) > { > assert(a.length, "Attempting to fetch the front of an empty > array of " ~ T.stringof); > size_t i = 0; > return decode(a, i); > } > > So anytime I write a generic algorithm using empty, front, and popFront(), it decodes the strings, which is a large pessimization. OTOH, it is actually correct by default. If it *didn't* decode, things like std.algorithm.sort and std.range.retro would mangle all your multibyte UTF-8 characters. Having said that, though, it would be nice if there were a standard ASCII string type that didn't decode by default. Always decoding strings *is* slow, esp. when you already know that it only contains ASCII characters. Maybe we want something like this: struct AsciiString { immutable(ubyte)[] impl; alias impl this; // This is so that .front returns char instead of ubyte @property char front() { return cast(char) impl[0]; } char opIndex(size_t idx) { ... /* ditto */ } ... // other range methods here } AsciiString assumeAscii(string s) { return AsciiString(cast(immutable(ubyte)[]) s); } T -- "640K ought to be enough" -- Bill G., 1984. "The Internet is not a primary goal for PC usage" -- Bill G., 1995. "Linux has no impact on Microsoft's strategy" -- Bill G., 1999. |
November 28, 2013 Re: Unicode handling comparison | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | http://dlang.org/phobos/std_encoding.html#.AsciiString ? |
November 28, 2013 Re: Unicode handling comparison | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dicebot | On Thursday, 28 November 2013 at 18:55:44 UTC, Dicebot wrote:
> http://dlang.org/phobos/std_encoding.html#.AsciiString ?
Yeah, that or just ubyte[].
The problem with both of these though, is printing :/ (which
prints ugly as sin)
Something like:
struct AsciiChar
{
private char c;
alias c this;
}
Could be a very easy and efficient alternative.
|
November 28, 2013 Re: Unicode handling comparison | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | 28-Nov-2013 17:24, monarch_dodra пишет: > On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright > wrote: >> Sadly, > > I think it's great. It means by default, your strings will always > be handled correctly. I think there's quite a few algorithms that > were written without ever taking strings into account, but still > happen to work with them. > The greatest problem is surprisingly that you can't use range functions to the implicit codeunit range even if you REALLY wanted to. To not go far away - the only reason std.regex can't take e.g. retro of string: match(retro("hleb), ".el."); is because of the automatic dumbing down at the moment you apply range adapter. What I'd need in std.regex is a codeunit range that due to convention also "happens to be" a range of codepoints. The second problem is that string code is carefully special cased but the effort is completely wasted the moment you have a slice of char-s that come from anywhere else (circular buffer, for instance) then built-in strings. I had a (a bit cloudy) vision of settling encoded ranges problem once and for good. That includes defining notion of an encoded range that is 2 in one: some stronger (as in capabilities) range of code elements and the default decoded view imposed on top of it (that can be weaker). -- Dmitry Olshansky |
November 29, 2013 Re: Unicode handling comparison | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On 11/28/2013 10:19 AM, H. S. Teoh wrote:
> Always decoding strings
> *is* slow, esp. when you already know that it only contains ASCII
> characters.
It doesn't have to be merely ASCII. You can do string substring searches without any need for decoding, for example. You don't even need decoding to do regex. Decoding is rarely needed.
|
November 29, 2013 Re: Unicode handling comparison | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On 11/28/2013 11:32 AM, Dmitry Olshansky wrote:
> I had a (a bit cloudy) vision of settling encoded ranges problem once and for
> good. That includes defining notion of an encoded range that is 2 in one: some
> stronger (as in capabilities) range of code elements and the default decoded
> view imposed on top of it (that can be weaker).
I suspect the correct approach would be to have the range over string to produce bytes. If you want decoded values, then run it through an adapter algorithm.
|
November 29, 2013 Re: Unicode handling comparison | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On 11/27/2013 12:06 PM, Dmitry Olshansky wrote:
> 27-Nov-2013 18:45, David Nadlinger пишет:
>> As far as I'm aware, this behavior is the result of a deliberate
>> decision, as normalizing strings on the fly isn't really cheap.
>
> It's anything but cheap.
> At the minimum imagine crawling the string and issuing a table lookup per
> codepoint.
Decoding isn't cheap, either, which is why I rant about it being the default behavior.
|
November 29, 2013 Re: Unicode handling comparison | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On Wednesday, 27 November 2013 at 20:13:32 UTC, Dmitry Olshansky wrote: > I could have sworn we had byGrapheme somewhere, well apparently not :( Simple attempt: https://github.com/D-Programming-Language/phobos/pull/1736 |
Copyright © 1999-2021 by the D Language Foundation