January 11, 2011
Andrei Alexandrescu napisał:

> I've been thinking on how to better deal with Unicode strings. Currently strings are formally bidirectional ranges with a surreptitious random access interface. The random access interface accesses the support of the string, which is understood to hold data in a variable-encoded format. For as long as the programmer understands this relationship, code for string manipulation can be written with relative ease. However, there is still room for writing wrong code that looks legit.
> 
> Sometimes the best way to tackle a hairy reality is to invite it to the negotiation table and offer it promotion to first-class abstraction status. Along that vein I was thinking of defining a new range: VLERange, i.e. Variable Length Encoding Range. Such a range would have the power somewhere in between bidirectional and random access.
> 
> The primitives offered would include empty, access to front and back, popFront and popBack (just like BidirectionalRange), and in addition properties typical of random access ranges: indexing, slicing, and length.

For some compressions implementing *back is troublesome if not impossible...

> Note that the result of the indexing operator is not the same as the element type of the range, as it only represents the unit of encoding.

It's worth to mention it explicitly -- a VLERange is dually typed. It's important for searching. Statically check if original and encoded match, if so, perform fast search on directly on encoded elements. I think an important feature of a VLERange should be dropping  itself down to a encoded-typed range, so that front and back return raw data.

Dual typing will also affect foreach -- in general case you'd want to choose whether to decode or not by typing the element.

I can't stop thinking that VLERange is a two-piece bikini making a bare random-access range safe to look at, and that you can take off when partners have confidence, not a limited random-access probing facility to span the void between front and back.

> In addition to these (and connecting the two), a VLERange would offer two additional primitives:
> 
> 1. size_t stepSize(size_t offset) gives the length of the step needed to skip to the next element.
> 
> 2. size_t backstepSize(size_t offset) gives the size of the _backward_ step that goes to the previous element.
> 
> In both cases, offset is assumed to be at the beginning of a logical element of the range.

So when I move the spinner in an iPod, I get catapulted in position with the raw data opIndex and from there I try to work my way to the next frame to start playback. Sounds promising.

> I suspect that a lot of functions in std.string can be written without Unicode-specific knowledge just by relying on such an interface. Moreover, algorithms can be generalized to other structures that use variable-length encoding, such as those used in data compression. (In that case, the support would be a bit array and the encoded type would be ubyte.)

I agree, acknowledging encoding/compression as a general direction will bring substantial benefits.

> Writing to such ranges is not addressed by this design. Ideas are welcome.

Yeah, we can address outputting later, that's fair.

> Adding VLERange would legitimize strings and would clarify their handling, at the cost of adding one additional concept that needs to be minded. Is the trade-off worthwhile?

Well, the only way to find out is try it. My advice: VLERanges originated as a solution to the string problem, so start with a non-string incarnation. Having at least two (one, we know, is string) plugs that fit the same socket will spur confidence in the abstraction.

-- 
Tomek

January 11, 2011
On 1/11/11 11:13 AM, Michel Fortin wrote:
> On 2011-01-11 11:36:54 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>
>> On 1/11/11 4:41 AM, Michel Fortin wrote:
>>> For instance, say we have a conversion range taking a Unicode string and
>>> converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
>>> "oe" (one chararacter to two characters), in this case 'front' could
>>> simply return "oe" (two characters) in one iteration, with stepSize
>>> being the size of the "œ" code point. In the same conversion process,
>>> encountering "e" followed by a combining "´" would return pre-combined
>>> character "é" (two characters to one character).
>>
>> In the design as I thought of it, the effective length of one logical
>> element is one or more representation units. My understanding is that
>> you are referring to a fractional number of representation units for
>> one logical element.
>
> Your understanding is correct.
>
> I think both cases (one becomes many & many becomes one) are important
> and must be supported. Your proposal only deal with the many-becomes-one
> case.

I disagree. When I suggested this design I was worried of over-abstracting. Now this looks like abstracting for stuff that hasn't even been addressed concretely yet.

Besides, using bit as an encoding unit sounds like an acceptable approach for anything fractional.

> I proposed returning arrays so we can deal with the one-becomes-many
> case ("œ" becoming "oe"). Another idea would be to introduce "substeps".
> When checking for the next character, in addition to determining its
> step length you could also determine the number of substeps in it. "œ"
> would have two substeps, "o" and "e", and when there is no longer any
> substep you move to the next step.
>
> All this said, I think this should stay an implementation detail as this
> would allow a variety of strategies. Also, keeping this an
> implementation detail means that your proposed 'stepSize' and
> 'backstepSize' need to be an implementation detail too (because they
> won't make sense for the one-to-many case). So they can't really be part
> of a standard VLE interface.

If you don't have at least stepSize that tells you how large the stride is to get to the next element, it becomes impossible to move within the range using integral indexes.

> As far as I know, all we really need to expose to algorithms is whether
> a range has elements of variable length, because this has an impact on
> your indexing capabilities. The rest seems unnecessary to me, or am I
> missing some use cases?

I think you could say that you don't really need stepSize because you can compute it as follows:

auto r1 = r;
r1.popFront();
size_t stepSize = r.length - r1.length;

This is tenuous, inefficient, and impossible if the support range doesn't support length (I realize that variable-length encodings work over other ranges than random access, but then again this may be an overgeneralization).


Andrei
January 11, 2011
On 1/11/11 11:21 AM, Steven Schveighoffer wrote:
> On Tue, 11 Jan 2011 11:54:08 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>
>> On 1/11/11 5:30 AM, Steven Schveighoffer wrote:
>>> While this makes it possible to write algorithms that only accept
>>> VLERanges, I don't think it solves the major problem with strings --
>>> they are treated as arrays by the compiler.
>>
>> Except when they're not - foreach with dchar...
>
> This solitary difference is a very thin argument -- foreach(d;
> byDchar(str)) would be just as good without requiring compiler help.
>
>>
>>> I'd also rather see an indexing operation return the element type, and
>>> have a separate function to get the encoding unit. This makes more sense
>>> for generic code IMO.
>>
>> But that's neither here nor there. That would return the logical
>> element at a physical position. I am very doubtful that much generic
>> code could work without knowing they are in fact dealing with a
>> variable-length encoding.
>
> It depends on the function, and the way the indexing is implemented.
>
>>> I noticed you never commented on my proposed string type...
>>>
>>> That reminds me, I should update with suggested changes and re-post it.
>>
>> To be frank, I think it didn't mark a visible improvement. It solved
>> some problems and brought others. There was disagreement over the
>> offered primitives and their semantics.
>
> It is supposed to be simple, and provide the expected interface, without
> causing any undue performance degradation. That is, I should be able to
> do all the things with a replacement string type that I can with a char
> array today, as efficiently as I can today, except I should have to work
> to get at the code-units. The huge benefit is that I can say "I'm
> dealing with this as an array" when I know it's safe

Unfinished sentence? Anyway, for my money you just described what we have now.

> The disagreement will never be fully solved, as there is just as much
> disagreement about the current state of affairs ;) e.g. should foreach
> default to using dchar?

I disagree about the disagreement being unsolvable. I'm not rigid; if I saw a terrific abstraction in your string, I'd be all for it. It just shuffles some issues about, and although I agree it does one thing or two better than char[], at the end of the day it doesn't carry its weight.

>> That being said, it's good you are doing this work. In the best case,
>> you could bring a compelling abstraction to the table. In the worst,
>> you'll become as happy about D's strings as I am :o).
>
> I don't think I'll ever be 'happy' with the way strings sit in phobos
> currently. I typically deal in ASCII (i.e. code units), and phobos works
> very hard to prevent that.

I wonder if we could and should extend some of the functions in std.string to work with ubyte[]. I did add a function called representation() that I didn't document yet. Essentially representation gives you the ubyte[], ushort[], or uint[] underneath a string, with the same qualifiers. Whenever you want an algorithm to work on ASCII in earnest, you can pass representation(s) to it instead of s.

If you work a lot with ASCII, an AsciiString abstraction may be a better and more likely to be successful string type. Better yet, you could simply focus on AsciiChar and then define ASCII strings as arrays of AsciiChar.


Andrei
January 12, 2011
On 01/11/2011 08:09 PM, Andrei Alexandrescu wrote:
>> The main (and massively ignored) issue when manipulating unicode text is
>> rather that, unlike with legacy character sets, one codepoint does *not*
>> represent a character in the common sense. In character sets like
>> latin-1:
>> * each code represents a character, in the common sense (eg "à")
>> * each character representation has the same size (1 or 2 bytes)
>> * each character has a single representation ("à" --> always 0xe0)
>> All of this is wrong with unicode. And these are complicated and
>> high-level issues, that appear _after_ decoding, on codepoint sequences.
>>
>> If VLERange is helpful is dealing with those problems, then I don't
>> understand your presentation, sorry. Do you for instance mean such a
>> range would, under the hood, group together codes belonging to the same
>> character (thus making indexing meaningful), and/or normalise (decomp &
>> order) (thus allowing to comp/find/count correctly).?
>
> VLERange would offer automatic decoding in front, back, popFront, and
> popBack - just like BidirectionalRange does right now. It would also
> offer access to the representational support by means of indexing - also
> like char[] et al already do now.

IIUC, for the case of text, VLERange helps abstracting from the annoying fact that a codepoint is encoded as a variable number of code units.
What I meant is issues like:

    auto text = "a\u0302"d;
    writeln(text);                  // "â"
    auto range = VLERange(text);
    // extracts characters correctly?
    auto letter = range.front();    // "a" or "â"?
    // case yes: compares correctly?
    assert(range.front() == "â");   // fail or pass?

Both fail using all unicode-aware types I know of, because
1. They do not recognise that a character is represented by an arbitrary number of codes (code _points_).
2. They do not use normalised forms for comp, search, count, etc...
(while in unicode a given char can have several representations).

> The difference is that VLERange being
> a formal concept, algorithms can specialize on it instead of (a)
> specializing for UTF strings or (b) specializing for BidirectionalRange
> and then manually detecting isSomeString inside. Conversely, when
> defining an algorithm you can specify VLARange as a requirement.
> Boyer-Moore is a perfect example - it doesn't work on bidirectional
> ranges, but it does work on VLARange. I suspect there are many like it.
>
> Of course, it would help a lot if we figured other remarkable VLARanges.

I think I see the point, and the general usefulness of such an abstraction. But it would certainly be more useful in other fields than text manipulation, because there are far more annoying issues (that, like in example above, simply prevent code correctness).

Denis
_________________
vita es estrany
spir.wikidot.com

January 12, 2011
Sorry if I'm jumping inhere without the appropriate background, but I don't understand why jumping through these hoops are necessary.  Please let me know if I'm missing anything.

Many problems can be solved by another layer of indirection.  Isn't a string essentially a bidirectional range of code points built on top of a random access range of code units?  It seems to me that each abstraction separately already fits within the existing D range framework and all the difficulties arise as a consequence of trying to lump them into a single abstraction.

Why not choose which of these abstractions is most appropriate in a given situation instead of trying to shoe-horn both concepts into a single abstraction, and provide for easy conversion between them?  When character representation is the primary requirement then make it a bidirectional range of code points.  When storage representation and random access is required then make it a random access range of code units.
January 12, 2011
On 1/11/11 4:46 PM, spir wrote:
> On 01/11/2011 08:09 PM, Andrei Alexandrescu wrote:
>>> The main (and massively ignored) issue when manipulating unicode text is
>>> rather that, unlike with legacy character sets, one codepoint does *not*
>>> represent a character in the common sense. In character sets like
>>> latin-1:
>>> * each code represents a character, in the common sense (eg "à")
>>> * each character representation has the same size (1 or 2 bytes)
>>> * each character has a single representation ("à" --> always 0xe0)
>>> All of this is wrong with unicode. And these are complicated and
>>> high-level issues, that appear _after_ decoding, on codepoint sequences.
>>>
>>> If VLERange is helpful is dealing with those problems, then I don't
>>> understand your presentation, sorry. Do you for instance mean such a
>>> range would, under the hood, group together codes belonging to the same
>>> character (thus making indexing meaningful), and/or normalise (decomp &
>>> order) (thus allowing to comp/find/count correctly).?
>>
>> VLERange would offer automatic decoding in front, back, popFront, and
>> popBack - just like BidirectionalRange does right now. It would also
>> offer access to the representational support by means of indexing - also
>> like char[] et al already do now.
>
> IIUC, for the case of text, VLERange helps abstracting from the annoying
> fact that a codepoint is encoded as a variable number of code units.
> What I meant is issues like:
>
> auto text = "a\u0302"d;
> writeln(text); // "â"
> auto range = VLERange(text);
> // extracts characters correctly?
> auto letter = range.front(); // "a" or "â"?
> // case yes: compares correctly?
> assert(range.front() == "â"); // fail or pass?

You should try text.front right now, you might be surprised :o).

Andrei
January 12, 2011
On 01/12/2011 02:22 AM, Andrei Alexandrescu wrote:
>> IIUC, for the case of text, VLERange helps abstracting from the annoying
>> fact that a codepoint is encoded as a variable number of code units.
>> What I meant is issues like:
>>
>> auto text = "a\u0302"d;
>> writeln(text); // "â"
>> auto range = VLERange(text);
>> // extracts characters correctly?
>> auto letter = range.front(); // "a" or "â"?
>> // case yes: compares correctly?
>> assert(range.front() == "â"); // fail or pass?
>
> You should try text.front right now, you might be surprised :o).

Hum, right now incorrectly returns "a" as expected. And indeed
	assert ("â" == "a\u0302");
incorrectly fails as expected.
Both would work with legacy charsets like latin-1. This is a new issue introduced with UCS, that requires an additional level of abstraction (in addition to the one required by the distincton codepoint/codeunit!)

You may have a look at https://bitbucket.org/denispir/denispir-d/src/5ec6fe1e1065/Text.html for a rough implementation of a type that does the right thing, & at https://bitbucket.org/denispir/denispir-d/src/5ec6fe1e1065/U%20missing%20level%20of%20abstraction for a (far too long) explanation.
(I have tried to mention those problems a dozen times already, but for any reason nearly everybody seem definitely deaf in front of them.)


Denis
_________________
vita es estrany
spir.wikidot.com

January 12, 2011
On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw@acres.com.au> said:

> Sorry if I'm jumping inhere without the appropriate background, but I don't understand why jumping through these hoops are necessary.  Please let me know if I'm missing anything.
> 
> Many problems can be solved by another layer of indirection.  Isn't a string essentially a bidirectional range of code points built on top of a random access range of code units?

Actually, displaying a UTF-8/UTF-16 string involves a range of of glyphs layered over a range of graphemes layered over a range of code points layered over a range of code units. Glyphs represent the visual characters you can get from a font, they often map one-to-one with graphemes but not always (ligatures for instance). Graphemes are what people generally reason about when they see text (the so called "user-perceived characters"), they often map one-to-one with code points but not always (combining marks for instance). Code points are a list of standardized codes representing various elements of a string, and code units basically encode the code points.

If you're writing an XML, JSON or whatever else parser you'll probably care about code points. If you're advancing the insertion point in a text field or count the number of user-perceived characters you'll probably want to deal with graphemes. For searching a substring inside a string, or comparing strings you'll probably want to deal with either graphemes or collation elements (collation elements are layered on top of code points). To print a string you'll need to map graphemes to the glyphs from a particular font.

Reducing string operations to code points manipulations will only work as long as all your graphemes, collation elements, or glyphs map one-to-one with code points.


> It seems to me that each abstraction separately already fits within the existing D range framework and all the difficulties arise as a consequence of trying to lump them into a single abstraction.

It's true that each of these abstraction can fit within the existing range framework.


> Why not choose which of these abstractions is most appropriate in a given situation instead of trying to shoe-horn both concepts into a single abstraction, and provide for easy conversion between them?  When character representation is the primary requirement then make it a bidirectional range of code points.  When storage representation and random access is required then make it a random access range of code units.

I think you're right. The need for a new concept isn't that great, and it gets complicated really fast.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 12, 2011
Michel Fortin wrote:
> On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw@acres.com.au> said:
>> Why not choose which of these abstractions is most appropriate in a given situation instead of trying to shoe-horn both concepts into a single abstraction, and provide for easy conversion between them?  When character representation is the primary requirement then make it a bidirectional range of code points.  When storage representation and random access is required then make it a random access range of code units.
> 
> I think you're right. The need for a new concept isn't that great, and it gets complicated really fast.

I think the only problem that we really have, is that "char[]", "dchar[]" implies that code points is always the appropriate level of abstraction.




January 12, 2011
On 1/12/11 11:28 AM, Don wrote:
> Michel Fortin wrote:
>> On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw@acres.com.au> said:
>>> Why not choose which of these abstractions is most appropriate in a
>>> given situation instead of trying to shoe-horn both concepts into a
>>> single abstraction, and provide for easy conversion between them?
>>> When character representation is the primary requirement then make it
>>> a bidirectional range of code points. When storage representation and
>>> random access is required then make it a random access range of code
>>> units.
>>
>> I think you're right. The need for a new concept isn't that great, and
>> it gets complicated really fast.
>
> I think the only problem that we really have, is that "char[]",
> "dchar[]" implies that code points is always the appropriate level of
> abstraction.

I hope to assuage part of that issue with representation(). Again, it's not documented yet (mainly because of the famous ddoc bug that prevents auto functions from carrying documentation). Here it is:

/**
 * Returns the representation type of a string, which is the same type
 * as the string except the character type is replaced by $(D ubyte),
 * $(D ushort), or $(D uint) depending on the character width.
 *
 * Example:
----
string s = "hello";
static assert(is(typeof(representation(s)) == immutable(ubyte)[]));
----
 */
/*private*/ auto representation(Char)(Char[] s) if (isSomeChar!Char)
{
    // Get representation type
    static if (Char.sizeof == 1) enum t = "ubyte";
    else static if (Char.sizeof == 2) enum t = "ushort";
    else static if (Char.sizeof == 4) enum t = "uint";
    else static assert(false); // can't happen due to isSomeChar!Char

    // Get representation qualifier
    static if (is(Char == immutable)) enum q = "immutable";
    else static if (is(Char == const)) enum q = "const";
    else static if (is(Char == shared)) enum q = "shared";
    else enum q = "";

    // Result type is qualifier(RepType)[]
    static if (q.length)
        return mixin("cast(" ~ q ~ "(" ~ t ~ ")[]) s");
    else
        return mixin("cast(" ~ t ~ "[]) s");
}


Andrei