January 11, 2011
I've been thinking on how to better deal with Unicode strings. Currently strings are formally bidirectional ranges with a surreptitious random access interface. The random access interface accesses the support of the string, which is understood to hold data in a variable-encoded format. For as long as the programmer understands this relationship, code for string manipulation can be written with relative ease. However, there is still room for writing wrong code that looks legit.

Sometimes the best way to tackle a hairy reality is to invite it to the negotiation table and offer it promotion to first-class abstraction status. Along that vein I was thinking of defining a new range: VLERange, i.e. Variable Length Encoding Range. Such a range would have the power somewhere in between bidirectional and random access.

The primitives offered would include empty, access to front and back, popFront and popBack (just like BidirectionalRange), and in addition properties typical of random access ranges: indexing, slicing, and length. Note that the result of the indexing operator is not the same as the element type of the range, as it only represents the unit of encoding.

In addition to these (and connecting the two), a VLERange would offer two additional primitives:

1. size_t stepSize(size_t offset) gives the length of the step needed to skip to the next element.

2. size_t backstepSize(size_t offset) gives the size of the _backward_ step that goes to the previous element.

In both cases, offset is assumed to be at the beginning of a logical element of the range.

I suspect that a lot of functions in std.string can be written without Unicode-specific knowledge just by relying on such an interface. Moreover, algorithms can be generalized to other structures that use variable-length encoding, such as those used in data compression. (In that case, the support would be a bit array and the encoded type would be ubyte.)

Writing to such ranges is not addressed by this design. Ideas are welcome.

Adding VLERange would legitimize strings and would clarify their handling, at the cost of adding one additional concept that needs to be minded. Is the trade-off worthwhile?


Andrei
January 11, 2011
On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> I've been thinking on how to better deal with Unicode strings. Currently strings are formally bidirectional ranges with a surreptitious random access interface. The random access interface accesses the support of the string, which is understood to hold data in a variable-encoded format. For as long as the programmer understands this relationship, code for string manipulation can be written with relative ease. However, there is still room for writing wrong code that looks legit.
> 
> Sometimes the best way to tackle a hairy reality is to invite it to the negotiation table and offer it promotion to first-class abstraction status. Along that vein I was thinking of defining a new range: VLERange, i.e. Variable Length Encoding Range. Such a range would have the power somewhere in between bidirectional and random access.
> 
> The primitives offered would include empty, access to front and back, popFront and popBack (just like BidirectionalRange), and in addition properties typical of random access ranges: indexing, slicing, and length. Note that the result of the indexing operator is not the same as the element type of the range, as it only represents the unit of encoding.

Seems like a good idea to define things formally.


> In addition to these (and connecting the two), a VLERange would offer two additional primitives:
> 
> 1. size_t stepSize(size_t offset) gives the length of the step needed to skip to the next element.
> 
> 2. size_t backstepSize(size_t offset) gives the size of the _backward_ step that goes to the previous element.

I like the idea, but I'm not sure about this interface. What's the result of stepSize if your range must create two elements from one underlying unit? Perhaps in those cases the element type could be an array (to return more than one element from one iteration).

For instance, say we have a conversion range taking a Unicode string and converting it to ISO Latin 1. The best (lossy) conversion for "œ" is "oe" (one chararacter to two characters), in this case 'front' could simply return "oe" (two characters) in one iteration, with stepSize being the size of the "œ" code point. In the same conversion process, encountering "e" followed by a combining "´" would return pre-combined character "é" (two characters to one character).


> In both cases, offset is assumed to be at the beginning of a logical element of the range.
> 
> I suspect that a lot of functions in std.string can be written without Unicode-specific knowledge just by relying on such an interface. Moreover, algorithms can be generalized to other structures that use variable-length encoding, such as those used in data compression. (In that case, the support would be a bit array and the encoded type would be ubyte.)

Applicability to other problems seems like a valuable benefit.


> Writing to such ranges is not addressed by this design. Ideas are welcome.

Writing, as in assigning to 'front'? That's not really possible with variable-length units as it'd need to shift everything in case of a length difference. Or maybe you meant writing as in having an output range for variable-length elements... I'm not sure


> Adding VLERange would legitimize strings and would clarify their handling, at the cost of adding one additional concept that needs to be minded. Is the trade-off worthwhile?

In my opinion it's not a trade-off at all, it's a formalization of how strings are handled which is better in every regard than a "special case". I welcome this move very much.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 11, 2011
On Mon, 10 Jan 2011 22:57:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> I've been thinking on how to better deal with Unicode strings. Currently strings are formally bidirectional ranges with a surreptitious random access interface. The random access interface accesses the support of the string, which is understood to hold data in a variable-encoded format. For as long as the programmer understands this relationship, code for string manipulation can be written with relative ease. However, there is still room for writing wrong code that looks legit.
>
> Sometimes the best way to tackle a hairy reality is to invite it to the negotiation table and offer it promotion to first-class abstraction status. Along that vein I was thinking of defining a new range: VLERange, i.e. Variable Length Encoding Range. Such a range would have the power somewhere in between bidirectional and random access.
>
> The primitives offered would include empty, access to front and back, popFront and popBack (just like BidirectionalRange), and in addition properties typical of random access ranges: indexing, slicing, and length. Note that the result of the indexing operator is not the same as the element type of the range, as it only represents the unit of encoding.
>
> In addition to these (and connecting the two), a VLERange would offer two additional primitives:
>
> 1. size_t stepSize(size_t offset) gives the length of the step needed to skip to the next element.
>
> 2. size_t backstepSize(size_t offset) gives the size of the _backward_ step that goes to the previous element.
>
> In both cases, offset is assumed to be at the beginning of a logical element of the range.
>
> I suspect that a lot of functions in std.string can be written without Unicode-specific knowledge just by relying on such an interface. Moreover, algorithms can be generalized to other structures that use variable-length encoding, such as those used in data compression. (In that case, the support would be a bit array and the encoded type would be ubyte.)
>
> Writing to such ranges is not addressed by this design. Ideas are welcome.
>
> Adding VLERange would legitimize strings and would clarify their handling, at the cost of adding one additional concept that needs to be minded. Is the trade-off worthwhile?

While this makes it possible to write algorithms that only accept VLERanges, I don't think it solves the major problem with strings -- they are treated as arrays by the compiler.

I'd also rather see an indexing operation return the element type, and have a separate function to get the encoding unit.  This makes more sense for generic code IMO.

I noticed you never commented on my proposed string type...

That reminds me, I should update with suggested changes and re-post it.

-Steve
January 11, 2011
On 1/11/11 4:41 AM, Michel Fortin wrote:
> On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>> In addition to these (and connecting the two), a VLERange would offer
>> two additional primitives:
>>
>> 1. size_t stepSize(size_t offset) gives the length of the step needed
>> to skip to the next element.
>>
>> 2. size_t backstepSize(size_t offset) gives the size of the _backward_
>> step that goes to the previous element.
>
> I like the idea, but I'm not sure about this interface. What's the
> result of stepSize if your range must create two elements from one
> underlying unit? Perhaps in those cases the element type could be an
> array (to return more than one element from one iteration).
>
> For instance, say we have a conversion range taking a Unicode string and
> converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
> "oe" (one chararacter to two characters), in this case 'front' could
> simply return "oe" (two characters) in one iteration, with stepSize
> being the size of the "œ" code point. In the same conversion process,
> encountering "e" followed by a combining "´" would return pre-combined
> character "é" (two characters to one character).

In the design as I thought of it, the effective length of one logical element is one or more representation units. My understanding is that you are referring to a fractional number of representation units for one logical element.

>> Writing to such ranges is not addressed by this design. Ideas are
>> welcome.
>
> Writing, as in assigning to 'front'? That's not really possible with
> variable-length units as it'd need to shift everything in case of a
> length difference. Or maybe you meant writing as in having an output
> range for variable-length elements... I'm not sure

Well all of the above :o). Clearly assigning to e.g. front or back should not work. The question is what kind of API can we provide beyond simple append with put().


Andrei
January 11, 2011
On 1/11/11 5:30 AM, Steven Schveighoffer wrote:
> While this makes it possible to write algorithms that only accept
> VLERanges, I don't think it solves the major problem with strings --
> they are treated as arrays by the compiler.

Except when they're not - foreach with dchar...

> I'd also rather see an indexing operation return the element type, and
> have a separate function to get the encoding unit. This makes more sense
> for generic code IMO.

But that's neither here nor there. That would return the logical element at a physical position. I am very doubtful that much generic code could work without knowing they are in fact dealing with a variable-length encoding.

> I noticed you never commented on my proposed string type...
>
> That reminds me, I should update with suggested changes and re-post it.

To be frank, I think it didn't mark a visible improvement. It solved some problems and brought others. There was disagreement over the offered primitives and their semantics.

That being said, it's good you are doing this work. In the best case, you could bring a compelling abstraction to the table. In the worst, you'll become as happy about D's strings as I am :o).


Andrei

January 11, 2011
On 01/11/2011 02:30 PM, Steven Schveighoffer wrote:
> On Mon, 10 Jan 2011 22:57:36 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>
>> I've been thinking on how to better deal with Unicode strings.
>> Currently strings are formally bidirectional ranges with a
>> surreptitious random access interface. The random access interface
>> accesses the support of the string, which is understood to hold data
>> in a variable-encoded format. For as long as the programmer
>> understands this relationship, code for string manipulation can be
>> written with relative ease. However, there is still room for writing
>> wrong code that looks legit.
>>
>> Sometimes the best way to tackle a hairy reality is to invite it to
>> the negotiation table and offer it promotion to first-class
>> abstraction status. Along that vein I was thinking of defining a new
>> range: VLERange, i.e. Variable Length Encoding Range. Such a range
>> would have the power somewhere in between bidirectional and random
>> access.
>>
>> The primitives offered would include empty, access to front and back,
>> popFront and popBack (just like BidirectionalRange), and in addition
>> properties typical of random access ranges: indexing, slicing, and
>> length. Note that the result of the indexing operator is not the same
>> as the element type of the range, as it only represents the unit of
>> encoding.
>>
>> In addition to these (and connecting the two), a VLERange would offer
>> two additional primitives:
>>
>> 1. size_t stepSize(size_t offset) gives the length of the step needed
>> to skip to the next element.
>>
>> 2. size_t backstepSize(size_t offset) gives the size of the _backward_
>> step that goes to the previous element.
>>
>> In both cases, offset is assumed to be at the beginning of a logical
>> element of the range.
>>
>> I suspect that a lot of functions in std.string can be written without
>> Unicode-specific knowledge just by relying on such an interface.
>> Moreover, algorithms can be generalized to other structures that use
>> variable-length encoding, such as those used in data compression. (In
>> that case, the support would be a bit array and the encoded type would
>> be ubyte.)
>>
>> Writing to such ranges is not addressed by this design. Ideas are
>> welcome.
>>
>> Adding VLERange would legitimize strings and would clarify their
>> handling, at the cost of adding one additional concept that needs to
>> be minded. Is the trade-off worthwhile?
>
> While this makes it possible to write algorithms that only accept
> VLERanges, I don't think it solves the major problem with strings --
> they are treated as arrays by the compiler.
>
> I'd also rather see an indexing operation return the element type, and
> have a separate function to get the encoding unit. This makes more sense
> for generic code IMO.
>
> I noticed you never commented on my proposed string type...
>
> That reminds me, I should update with suggested changes and re-post it.

People interested in solving the general problem with Unicode strings may have a look at https://bitbucket.org/denispir/denispir-d. All constructive feedback welcome.
(This will be asked for review in a short while. The main / client interface module is Text.d. A (long) presentation of the issues, reasons, solution can be found in the text called "U missing level of abstraction")

Denis
_________________
vita es estrany
spir.wikidot.com

January 11, 2011
On 01/11/2011 05:36 PM, Andrei Alexandrescu wrote:
> On 1/11/11 4:41 AM, Michel Fortin wrote:
>> On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail@erdani.org> said:
>>> In addition to these (and connecting the two), a VLERange would offer
>>> two additional primitives:
>>>
>>> 1. size_t stepSize(size_t offset) gives the length of the step needed
>>> to skip to the next element.
>>>
>>> 2. size_t backstepSize(size_t offset) gives the size of the _backward_
>>> step that goes to the previous element.
>>
>> I like the idea, but I'm not sure about this interface. What's the
>> result of stepSize if your range must create two elements from one
>> underlying unit? Perhaps in those cases the element type could be an
>> array (to return more than one element from one iteration).
>>
>> For instance, say we have a conversion range taking a Unicode string and
>> converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
>> "oe" (one chararacter to two characters), in this case 'front' could
>> simply return "oe" (two characters) in one iteration, with stepSize
>> being the size of the "œ" code point. In the same conversion process,
>> encountering "e" followed by a combining "´" would return pre-combined
>> character "é" (two characters to one character).
>
> In the design as I thought of it, the effective length of one logical
> element is one or more representation units. My understanding is that
> you are referring to a fractional number of representation units for one
> logical element.

I think Michel is right. If I understand correctly, VLERange addresses the low-level and rather simple issue of each codepoint beeing encoding as a variable number of code units. Right?
If yes, then what is the advantage of VLERange? D already has string/wstring/dstring, allowing to work with the most advatageous encoding according to given source data, and dstring abstracting from low-level encoding issues.

The main (and massively ignored) issue when manipulating unicode text is rather that, unlike with legacy character sets, one codepoint does *not* represent a character in the common sense. In character sets like latin-1:
* each code represents a character, in the common sense (eg "à")
* each character representation has the same size (1 or 2 bytes)
* each character has a single representation ("à" --> always 0xe0)
All of this is wrong with unicode. And these are complicated and high-level issues, that appear _after_ decoding, on codepoint sequences.

If VLERange is helpful is dealing with those problems, then I don't understand your presentation, sorry. Do you for instance mean such a range would, under the hood, group together codes belonging to the same character (thus making indexing meaningful), and/or normalise (decomp & order) (thus allowing to comp/find/count correctly).?


denis
_________________
vita es estrany
spir.wikidot.com

January 11, 2011
On 1/11/11 9:09 AM, spir wrote:
> On 01/11/2011 05:36 PM, Andrei Alexandrescu wrote:
>> On 1/11/11 4:41 AM, Michel Fortin wrote:
>>> On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
>>> <SeeWebsiteForEmail@erdani.org> said:
>>>> In addition to these (and connecting the two), a VLERange would offer
>>>> two additional primitives:
>>>>
>>>> 1. size_t stepSize(size_t offset) gives the length of the step needed
>>>> to skip to the next element.
>>>>
>>>> 2. size_t backstepSize(size_t offset) gives the size of the _backward_
>>>> step that goes to the previous element.
>>>
>>> I like the idea, but I'm not sure about this interface. What's the
>>> result of stepSize if your range must create two elements from one
>>> underlying unit? Perhaps in those cases the element type could be an
>>> array (to return more than one element from one iteration).
>>>
>>> For instance, say we have a conversion range taking a Unicode string and
>>> converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
>>> "oe" (one chararacter to two characters), in this case 'front' could
>>> simply return "oe" (two characters) in one iteration, with stepSize
>>> being the size of the "œ" code point. In the same conversion process,
>>> encountering "e" followed by a combining "´" would return pre-combined
>>> character "é" (two characters to one character).
>>
>> In the design as I thought of it, the effective length of one logical
>> element is one or more representation units. My understanding is that
>> you are referring to a fractional number of representation units for one
>> logical element.
>
> I think Michel is right. If I understand correctly, VLERange addresses
> the low-level and rather simple issue of each codepoint beeing encoding
> as a variable number of code units. Right?
> If yes, then what is the advantage of VLERange? D already has
> string/wstring/dstring, allowing to work with the most advatageous
> encoding according to given source data, and dstring abstracting from
> low-level encoding issues.

It' not about the data, it's about algorithms. Currently there are algorithms that ostensibly work for bidirectional ranges, but internally "cheat" by detecting that the input is actually a string, and use that knowledge for better implementations.

The benefit of VLERange would that that it legitimizes those algorithms. I wouldn't be surprised if an entire class of algorithms would in fact require VLERange (e.g. many of those that we commonly consider today "string" algorithms).

> The main (and massively ignored) issue when manipulating unicode text is
> rather that, unlike with legacy character sets, one codepoint does *not*
> represent a character in the common sense. In character sets like latin-1:
> * each code represents a character, in the common sense (eg "à")
> * each character representation has the same size (1 or 2 bytes)
> * each character has a single representation ("à" --> always 0xe0)
> All of this is wrong with unicode. And these are complicated and
> high-level issues, that appear _after_ decoding, on codepoint sequences.
>
> If VLERange is helpful is dealing with those problems, then I don't
> understand your presentation, sorry. Do you for instance mean such a
> range would, under the hood, group together codes belonging to the same
> character (thus making indexing meaningful), and/or normalise (decomp &
> order) (thus allowing to comp/find/count correctly).?

VLERange would offer automatic decoding in front, back, popFront, and popBack - just like BidirectionalRange does right now. It would also offer access to the representational support by means of indexing - also like char[] et al already do now. The difference is that VLERange being a formal concept, algorithms can specialize on it instead of (a) specializing for UTF strings or (b) specializing for BidirectionalRange and then manually detecting isSomeString inside. Conversely, when defining an algorithm you can specify VLARange as a requirement. Boyer-Moore is a perfect example - it doesn't work on bidirectional ranges, but it does work on VLARange. I suspect there are many like it.

Of course, it would help a lot if we figured other remarkable VLARanges. Here are a few that come to mind:

* Multibyte encodings other than UTF. Currently we have no special support for those beyond e.g. forward or bidirectional ranges.

* Huffman, RLE, LZ encoded buffers (and many other compressed formats)

* Vocabulary-based translation systems, e.g. associate each word with a number.

* Others...?

Some of these are forward-only (don't allow bidirectional access). Once we have a number of examples, it would be great to figure a number of remarkable algorithms operating on them.


Andrei
January 11, 2011
On 2011-01-11 11:36:54 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> On 1/11/11 4:41 AM, Michel Fortin wrote:
>> For instance, say we have a conversion range taking a Unicode string and
>> converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
>> "oe" (one chararacter to two characters), in this case 'front' could
>> simply return "oe" (two characters) in one iteration, with stepSize
>> being the size of the "œ" code point. In the same conversion process,
>> encountering "e" followed by a combining "´" would return pre-combined
>> character "é" (two characters to one character).
> 
> In the design as I thought of it, the effective length of one logical element is one or more representation units. My understanding is that you are referring to a fractional number of representation units for one logical element.

Your understanding is correct.

I think both cases (one becomes many & many becomes one) are important and must be supported. Your proposal only deal with the many-becomes-one case.

I proposed returning arrays so we can deal with the one-becomes-many case ("œ" becoming "oe"). Another idea would be to introduce "substeps". When checking for the next character, in addition to determining its step length you could also determine the number of substeps in it. "œ" would have two substeps, "o" and "e", and when there is no longer any substep you move to the next step.

All this said, I think this should stay an implementation detail as this would allow a variety of strategies. Also, keeping this an implementation detail means that your proposed 'stepSize' and 'backstepSize' need to be an implementation detail too (because they won't make sense for the one-to-many case). So they can't really be part of a standard VLE interface.

As far as I know, all we really need to expose to algorithms is whether a range has elements of variable length, because this has an impact on your indexing capabilities. The rest seems unnecessary to me, or am I missing some use cases?

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 11, 2011
On Tue, 11 Jan 2011 11:54:08 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 1/11/11 5:30 AM, Steven Schveighoffer wrote:
>> While this makes it possible to write algorithms that only accept
>> VLERanges, I don't think it solves the major problem with strings --
>> they are treated as arrays by the compiler.
>
> Except when they're not - foreach with dchar...

This solitary difference is a very thin argument -- foreach(d; byDchar(str)) would be just as good without requiring compiler help.

>
>> I'd also rather see an indexing operation return the element type, and
>> have a separate function to get the encoding unit. This makes more sense
>> for generic code IMO.
>
> But that's neither here nor there. That would return the logical element at a physical position. I am very doubtful that much generic code could work without knowing they are in fact dealing with a variable-length encoding.

It depends on the function, and the way the indexing is implemented.

>> I noticed you never commented on my proposed string type...
>>
>> That reminds me, I should update with suggested changes and re-post it.
>
> To be frank, I think it didn't mark a visible improvement. It solved some problems and brought others. There was disagreement over the offered primitives and their semantics.

It is supposed to be simple, and provide the expected interface, without causing any undue performance degradation.  That is, I should be able to do all the things with a replacement string type that I can with a char array today, as efficiently as I can today, except I should have to work to get at the code-units.  The huge benefit is that I can say "I'm dealing with this as an array" when I know it's safe

The disagreement will never be fully solved, as there is just as much disagreement about the current state of affairs ;)  e.g. should foreach default to using dchar?

> That being said, it's good you are doing this work. In the best case, you could bring a compelling abstraction to the table. In the worst, you'll become as happy about D's strings as I am :o).

I don't think I'll ever be 'happy' with the way strings sit in phobos currently.  I typically deal in ASCII (i.e. code units), and phobos works very hard to prevent that.

-Steve
« First   ‹ Prev
1 2 3 4 5 6 7 8 9 10 11
Top | Discussion index | About this forum | D home