View mode: basic / threaded / horizontal-split · Log in · Help
January 11, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
Andrei Alexandrescu napisał:

> I've been thinking on how to better deal with Unicode strings. Currently 
> strings are formally bidirectional ranges with a surreptitious random 
> access interface. The random access interface accesses the support of 
> the string, which is understood to hold data in a variable-encoded 
> format. For as long as the programmer understands this relationship, 
> code for string manipulation can be written with relative ease. However, 
> there is still room for writing wrong code that looks legit.
> 
> Sometimes the best way to tackle a hairy reality is to invite it to the 
> negotiation table and offer it promotion to first-class abstraction 
> status. Along that vein I was thinking of defining a new range: 
> VLERange, i.e. Variable Length Encoding Range. Such a range would have 
> the power somewhere in between bidirectional and random access.
> 
> The primitives offered would include empty, access to front and back, 
> popFront and popBack (just like BidirectionalRange), and in addition 
> properties typical of random access ranges: indexing, slicing, and 
> length.

For some compressions implementing *back is troublesome if not impossible...

> Note that the result of the indexing operator is not the same as 
> the element type of the range, as it only represents the unit of encoding.

It's worth to mention it explicitly -- a VLERange is dually typed. It's important for searching. Statically check if original and encoded match, if so, perform fast search on directly on encoded elements. I think an important feature of a VLERange should be dropping  itself down to a encoded-typed range, so that front and back return raw data.

Dual typing will also affect foreach -- in general case you'd want to choose whether to decode or not by typing the element.

I can't stop thinking that VLERange is a two-piece bikini making a bare random-access range safe to look at, and that you can take off when partners have confidence, not a limited random-access probing facility to span the void between front and back.

> In addition to these (and connecting the two), a VLERange would offer 
> two additional primitives:
> 
> 1. size_t stepSize(size_t offset) gives the length of the step needed to 
> skip to the next element.
> 
> 2. size_t backstepSize(size_t offset) gives the size of the _backward_ 
> step that goes to the previous element.
> 
> In both cases, offset is assumed to be at the beginning of a logical 
> element of the range.

So when I move the spinner in an iPod, I get catapulted in position with the raw data opIndex and from there I try to work my way to the next frame to start playback. Sounds promising.

> I suspect that a lot of functions in std.string can be written without 
> Unicode-specific knowledge just by relying on such an interface. 
> Moreover, algorithms can be generalized to other structures that use 
> variable-length encoding, such as those used in data compression. (In 
> that case, the support would be a bit array and the encoded type would 
> be ubyte.)

I agree, acknowledging encoding/compression as a general direction will bring substantial benefits.

> Writing to such ranges is not addressed by this design. Ideas are welcome.

Yeah, we can address outputting later, that's fair.

> Adding VLERange would legitimize strings and would clarify their 
> handling, at the cost of adding one additional concept that needs to be 
> minded. Is the trade-off worthwhile?

Well, the only way to find out is try it. My advice: VLERanges originated as a solution to the string problem, so start with a non-string incarnation. Having at least two (one, we know, is string) plugs that fit the same socket will spur confidence in the abstraction. 

-- 
Tomek
January 11, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/11/11 11:13 AM, Michel Fortin wrote:
> On 2011-01-11 11:36:54 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>
>> On 1/11/11 4:41 AM, Michel Fortin wrote:
>>> For instance, say we have a conversion range taking a Unicode string and
>>> converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
>>> "oe" (one chararacter to two characters), in this case 'front' could
>>> simply return "oe" (two characters) in one iteration, with stepSize
>>> being the size of the "œ" code point. In the same conversion process,
>>> encountering "e" followed by a combining "´" would return pre-combined
>>> character "é" (two characters to one character).
>>
>> In the design as I thought of it, the effective length of one logical
>> element is one or more representation units. My understanding is that
>> you are referring to a fractional number of representation units for
>> one logical element.
>
> Your understanding is correct.
>
> I think both cases (one becomes many & many becomes one) are important
> and must be supported. Your proposal only deal with the many-becomes-one
> case.

I disagree. When I suggested this design I was worried of 
over-abstracting. Now this looks like abstracting for stuff that hasn't 
even been addressed concretely yet.

Besides, using bit as an encoding unit sounds like an acceptable 
approach for anything fractional.

> I proposed returning arrays so we can deal with the one-becomes-many
> case ("œ" becoming "oe"). Another idea would be to introduce "substeps".
> When checking for the next character, in addition to determining its
> step length you could also determine the number of substeps in it. "œ"
> would have two substeps, "o" and "e", and when there is no longer any
> substep you move to the next step.
>
> All this said, I think this should stay an implementation detail as this
> would allow a variety of strategies. Also, keeping this an
> implementation detail means that your proposed 'stepSize' and
> 'backstepSize' need to be an implementation detail too (because they
> won't make sense for the one-to-many case). So they can't really be part
> of a standard VLE interface.

If you don't have at least stepSize that tells you how large the stride 
is to get to the next element, it becomes impossible to move within the 
range using integral indexes.

> As far as I know, all we really need to expose to algorithms is whether
> a range has elements of variable length, because this has an impact on
> your indexing capabilities. The rest seems unnecessary to me, or am I
> missing some use cases?

I think you could say that you don't really need stepSize because you 
can compute it as follows:

auto r1 = r;
r1.popFront();
size_t stepSize = r.length - r1.length;

This is tenuous, inefficient, and impossible if the support range 
doesn't support length (I realize that variable-length encodings work 
over other ranges than random access, but then again this may be an 
overgeneralization).


Andrei
January 11, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/11/11 11:21 AM, Steven Schveighoffer wrote:
> On Tue, 11 Jan 2011 11:54:08 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>
>> On 1/11/11 5:30 AM, Steven Schveighoffer wrote:
>>> While this makes it possible to write algorithms that only accept
>>> VLERanges, I don't think it solves the major problem with strings --
>>> they are treated as arrays by the compiler.
>>
>> Except when they're not - foreach with dchar...
>
> This solitary difference is a very thin argument -- foreach(d;
> byDchar(str)) would be just as good without requiring compiler help.
>
>>
>>> I'd also rather see an indexing operation return the element type, and
>>> have a separate function to get the encoding unit. This makes more sense
>>> for generic code IMO.
>>
>> But that's neither here nor there. That would return the logical
>> element at a physical position. I am very doubtful that much generic
>> code could work without knowing they are in fact dealing with a
>> variable-length encoding.
>
> It depends on the function, and the way the indexing is implemented.
>
>>> I noticed you never commented on my proposed string type...
>>>
>>> That reminds me, I should update with suggested changes and re-post it.
>>
>> To be frank, I think it didn't mark a visible improvement. It solved
>> some problems and brought others. There was disagreement over the
>> offered primitives and their semantics.
>
> It is supposed to be simple, and provide the expected interface, without
> causing any undue performance degradation. That is, I should be able to
> do all the things with a replacement string type that I can with a char
> array today, as efficiently as I can today, except I should have to work
> to get at the code-units. The huge benefit is that I can say "I'm
> dealing with this as an array" when I know it's safe

Unfinished sentence? Anyway, for my money you just described what we 
have now.

> The disagreement will never be fully solved, as there is just as much
> disagreement about the current state of affairs ;) e.g. should foreach
> default to using dchar?

I disagree about the disagreement being unsolvable. I'm not rigid; if I 
saw a terrific abstraction in your string, I'd be all for it. It just 
shuffles some issues about, and although I agree it does one thing or 
two better than char[], at the end of the day it doesn't carry its weight.

>> That being said, it's good you are doing this work. In the best case,
>> you could bring a compelling abstraction to the table. In the worst,
>> you'll become as happy about D's strings as I am :o).
>
> I don't think I'll ever be 'happy' with the way strings sit in phobos
> currently. I typically deal in ASCII (i.e. code units), and phobos works
> very hard to prevent that.

I wonder if we could and should extend some of the functions in 
std.string to work with ubyte[]. I did add a function called 
representation() that I didn't document yet. Essentially representation 
gives you the ubyte[], ushort[], or uint[] underneath a string, with the 
same qualifiers. Whenever you want an algorithm to work on ASCII in 
earnest, you can pass representation(s) to it instead of s.

If you work a lot with ASCII, an AsciiString abstraction may be a better 
and more likely to be successful string type. Better yet, you could 
simply focus on AsciiChar and then define ASCII strings as arrays of 
AsciiChar.


Andrei
January 12, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/11/2011 08:09 PM, Andrei Alexandrescu wrote:
>> The main (and massively ignored) issue when manipulating unicode text is
>> rather that, unlike with legacy character sets, one codepoint does *not*
>> represent a character in the common sense. In character sets like
>> latin-1:
>> * each code represents a character, in the common sense (eg "à")
>> * each character representation has the same size (1 or 2 bytes)
>> * each character has a single representation ("à" --> always 0xe0)
>> All of this is wrong with unicode. And these are complicated and
>> high-level issues, that appear _after_ decoding, on codepoint sequences.
>>
>> If VLERange is helpful is dealing with those problems, then I don't
>> understand your presentation, sorry. Do you for instance mean such a
>> range would, under the hood, group together codes belonging to the same
>> character (thus making indexing meaningful), and/or normalise (decomp &
>> order) (thus allowing to comp/find/count correctly).?
>
> VLERange would offer automatic decoding in front, back, popFront, and
> popBack - just like BidirectionalRange does right now. It would also
> offer access to the representational support by means of indexing - also
> like char[] et al already do now.

IIUC, for the case of text, VLERange helps abstracting from the annoying 
fact that a codepoint is encoded as a variable number of code units.
What I meant is issues like:

    auto text = "a\u0302"d;
    writeln(text);                  // "â"
    auto range = VLERange(text);
    // extracts characters correctly?
    auto letter = range.front();    // "a" or "â"?
    // case yes: compares correctly?
    assert(range.front() == "â");   // fail or pass?

Both fail using all unicode-aware types I know of, because
1. They do not recognise that a character is represented by an arbitrary 
number of codes (code _points_).
2. They do not use normalised forms for comp, search, count, etc...
(while in unicode a given char can have several representations).

> The difference is that VLERange being
> a formal concept, algorithms can specialize on it instead of (a)
> specializing for UTF strings or (b) specializing for BidirectionalRange
> and then manually detecting isSomeString inside. Conversely, when
> defining an algorithm you can specify VLARange as a requirement.
> Boyer-Moore is a perfect example - it doesn't work on bidirectional
> ranges, but it does work on VLARange. I suspect there are many like it.
>
> Of course, it would help a lot if we figured other remarkable VLARanges.

I think I see the point, and the general usefulness of such an 
abstraction. But it would certainly be more useful in other fields than 
text manipulation, because there are far more annoying issues (that, 
like in example above, simply prevent code correctness).

Denis
_________________
vita es estrany
spir.wikidot.com
January 12, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
Sorry if I'm jumping inhere without the appropriate background, but I 
don't understand why jumping through these hoops are necessary.  Please 
let me know if I'm missing anything.

Many problems can be solved by another layer of indirection.  Isn't a 
string essentially a bidirectional range of code points built on top of 
a random access range of code units?  It seems to me that each 
abstraction separately already fits within the existing D range 
framework and all the difficulties arise as a consequence of trying to 
lump them into a single abstraction.

Why not choose which of these abstractions is most appropriate in a 
given situation instead of trying to shoe-horn both concepts into a 
single abstraction, and provide for easy conversion between them?  When 
character representation is the primary requirement then make it a 
bidirectional range of code points.  When storage representation and 
random access is required then make it a random access range of code units.
January 12, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/11/11 4:46 PM, spir wrote:
> On 01/11/2011 08:09 PM, Andrei Alexandrescu wrote:
>>> The main (and massively ignored) issue when manipulating unicode text is
>>> rather that, unlike with legacy character sets, one codepoint does *not*
>>> represent a character in the common sense. In character sets like
>>> latin-1:
>>> * each code represents a character, in the common sense (eg "à")
>>> * each character representation has the same size (1 or 2 bytes)
>>> * each character has a single representation ("à" --> always 0xe0)
>>> All of this is wrong with unicode. And these are complicated and
>>> high-level issues, that appear _after_ decoding, on codepoint sequences.
>>>
>>> If VLERange is helpful is dealing with those problems, then I don't
>>> understand your presentation, sorry. Do you for instance mean such a
>>> range would, under the hood, group together codes belonging to the same
>>> character (thus making indexing meaningful), and/or normalise (decomp &
>>> order) (thus allowing to comp/find/count correctly).?
>>
>> VLERange would offer automatic decoding in front, back, popFront, and
>> popBack - just like BidirectionalRange does right now. It would also
>> offer access to the representational support by means of indexing - also
>> like char[] et al already do now.
>
> IIUC, for the case of text, VLERange helps abstracting from the annoying
> fact that a codepoint is encoded as a variable number of code units.
> What I meant is issues like:
>
> auto text = "a\u0302"d;
> writeln(text); // "â"
> auto range = VLERange(text);
> // extracts characters correctly?
> auto letter = range.front(); // "a" or "â"?
> // case yes: compares correctly?
> assert(range.front() == "â"); // fail or pass?

You should try text.front right now, you might be surprised :o).

Andrei
January 12, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/12/2011 02:22 AM, Andrei Alexandrescu wrote:
>> IIUC, for the case of text, VLERange helps abstracting from the annoying
>> fact that a codepoint is encoded as a variable number of code units.
>> What I meant is issues like:
>>
>> auto text = "a\u0302"d;
>> writeln(text); // "â"
>> auto range = VLERange(text);
>> // extracts characters correctly?
>> auto letter = range.front(); // "a" or "â"?
>> // case yes: compares correctly?
>> assert(range.front() == "â"); // fail or pass?
>
> You should try text.front right now, you might be surprised :o).

Hum, right now incorrectly returns "a" as expected. And indeed
	assert ("â" == "a\u0302");
incorrectly fails as expected.
Both would work with legacy charsets like latin-1. This is a new issue 
introduced with UCS, that requires an additional level of abstraction 
(in addition to the one required by the distincton codepoint/codeunit!)

You may have a look at 
https://bitbucket.org/denispir/denispir-d/src/5ec6fe1e1065/Text.html for 
a rough implementation of a type that does the right thing, & at 
https://bitbucket.org/denispir/denispir-d/src/5ec6fe1e1065/U%20missing%20level%20of%20abstraction 
for a (far too long) explanation.
(I have tried to mention those problems a dozen times already, but for 
any reason nearly everybody seem definitely deaf in front of them.)


Denis
_________________
vita es estrany
spir.wikidot.com
January 12, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw@acres.com.au> said:

> Sorry if I'm jumping inhere without the appropriate background, but I 
> don't understand why jumping through these hoops are necessary.  Please 
> let me know if I'm missing anything.
> 
> Many problems can be solved by another layer of indirection.  Isn't a 
> string essentially a bidirectional range of code points built on top of 
> a random access range of code units?

Actually, displaying a UTF-8/UTF-16 string involves a range of of 
glyphs layered over a range of graphemes layered over a range of code 
points layered over a range of code units. Glyphs represent the visual 
characters you can get from a font, they often map one-to-one with 
graphemes but not always (ligatures for instance). Graphemes are what 
people generally reason about when they see text (the so called 
"user-perceived characters"), they often map one-to-one with code 
points but not always (combining marks for instance). Code points are a 
list of standardized codes representing various elements of a string, 
and code units basically encode the code points.

If you're writing an XML, JSON or whatever else parser you'll probably 
care about code points. If you're advancing the insertion point in a 
text field or count the number of user-perceived characters you'll 
probably want to deal with graphemes. For searching a substring inside 
a string, or comparing strings you'll probably want to deal with either 
graphemes or collation elements (collation elements are layered on top 
of code points). To print a string you'll need to map graphemes to the 
glyphs from a particular font.

Reducing string operations to code points manipulations will only work 
as long as all your graphemes, collation elements, or glyphs map 
one-to-one with code points.


> It seems to me that each abstraction separately already fits within the 
> existing D range framework and all the difficulties arise as a 
> consequence of trying to lump them into a single abstraction.

It's true that each of these abstraction can fit within the existing 
range framework.


> Why not choose which of these abstractions is most appropriate in a 
> given situation instead of trying to shoe-horn both concepts into a 
> single abstraction, and provide for easy conversion between them?  When 
> character representation is the primary requirement then make it a 
> bidirectional range of code points.  When storage representation and 
> random access is required then make it a random access range of code 
> units.

I think you're right. The need for a new concept isn't that great, and 
it gets complicated really fast.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 12, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
Michel Fortin wrote:
> On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw@acres.com.au> said:
>> Why not choose which of these abstractions is most appropriate in a 
>> given situation instead of trying to shoe-horn both concepts into a 
>> single abstraction, and provide for easy conversion between them?  
>> When character representation is the primary requirement then make it 
>> a bidirectional range of code points.  When storage representation and 
>> random access is required then make it a random access range of code 
>> units.
> 
> I think you're right. The need for a new concept isn't that great, and 
> it gets complicated really fast.

I think the only problem that we really have, is that "char[]", 
"dchar[]" implies that code points is always the appropriate level of 
abstraction.
January 12, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/12/11 11:28 AM, Don wrote:
> Michel Fortin wrote:
>> On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw@acres.com.au> said:
>>> Why not choose which of these abstractions is most appropriate in a
>>> given situation instead of trying to shoe-horn both concepts into a
>>> single abstraction, and provide for easy conversion between them?
>>> When character representation is the primary requirement then make it
>>> a bidirectional range of code points. When storage representation and
>>> random access is required then make it a random access range of code
>>> units.
>>
>> I think you're right. The need for a new concept isn't that great, and
>> it gets complicated really fast.
>
> I think the only problem that we really have, is that "char[]",
> "dchar[]" implies that code points is always the appropriate level of
> abstraction.

I hope to assuage part of that issue with representation(). Again, it's 
not documented yet (mainly because of the famous ddoc bug that prevents 
auto functions from carrying documentation). Here it is:

/**
 * Returns the representation type of a string, which is the same type
 * as the string except the character type is replaced by $(D ubyte),
 * $(D ushort), or $(D uint) depending on the character width.
 *
 * Example:
----
string s = "hello";
static assert(is(typeof(representation(s)) == immutable(ubyte)[]));
----
 */
/*private*/ auto representation(Char)(Char[] s) if (isSomeChar!Char)
{
    // Get representation type
    static if (Char.sizeof == 1) enum t = "ubyte";
    else static if (Char.sizeof == 2) enum t = "ushort";
    else static if (Char.sizeof == 4) enum t = "uint";
    else static assert(false); // can't happen due to isSomeChar!Char

    // Get representation qualifier
    static if (is(Char == immutable)) enum q = "immutable";
    else static if (is(Char == const)) enum q = "const";
    else static if (is(Char == shared)) enum q = "shared";
    else enum q = "";

    // Result type is qualifier(RepType)[]
    static if (q.length)
        return mixin("cast(" ~ q ~ "(" ~ t ~ ")[]) s");
    else
        return mixin("cast(" ~ t ~ "[]) s");
}


Andrei
1 2 3 4 5 6
Top | Discussion index | About this forum | D home