View mode: basic / threaded / horizontal-split · Log in · Help
January 11, 2011
VLERange: a range in between BidirectionalRange and RandomAccessRange
I've been thinking on how to better deal with Unicode strings. Currently 
strings are formally bidirectional ranges with a surreptitious random 
access interface. The random access interface accesses the support of 
the string, which is understood to hold data in a variable-encoded 
format. For as long as the programmer understands this relationship, 
code for string manipulation can be written with relative ease. However, 
there is still room for writing wrong code that looks legit.

Sometimes the best way to tackle a hairy reality is to invite it to the 
negotiation table and offer it promotion to first-class abstraction 
status. Along that vein I was thinking of defining a new range: 
VLERange, i.e. Variable Length Encoding Range. Such a range would have 
the power somewhere in between bidirectional and random access.

The primitives offered would include empty, access to front and back, 
popFront and popBack (just like BidirectionalRange), and in addition 
properties typical of random access ranges: indexing, slicing, and 
length. Note that the result of the indexing operator is not the same as 
the element type of the range, as it only represents the unit of encoding.

In addition to these (and connecting the two), a VLERange would offer 
two additional primitives:

1. size_t stepSize(size_t offset) gives the length of the step needed to 
skip to the next element.

2. size_t backstepSize(size_t offset) gives the size of the _backward_ 
step that goes to the previous element.

In both cases, offset is assumed to be at the beginning of a logical 
element of the range.

I suspect that a lot of functions in std.string can be written without 
Unicode-specific knowledge just by relying on such an interface. 
Moreover, algorithms can be generalized to other structures that use 
variable-length encoding, such as those used in data compression. (In 
that case, the support would be a bit array and the encoded type would 
be ubyte.)

Writing to such ranges is not addressed by this design. Ideas are welcome.

Adding VLERange would legitimize strings and would clarify their 
handling, at the cost of adding one additional concept that needs to be 
minded. Is the trade-off worthwhile?


Andrei
January 11, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail@erdani.org> said:

> I've been thinking on how to better deal with Unicode strings. 
> Currently strings are formally bidirectional ranges with a 
> surreptitious random access interface. The random access interface 
> accesses the support of the string, which is understood to hold data in 
> a variable-encoded format. For as long as the programmer understands 
> this relationship, code for string manipulation can be written with 
> relative ease. However, there is still room for writing wrong code that 
> looks legit.
> 
> Sometimes the best way to tackle a hairy reality is to invite it to the 
> negotiation table and offer it promotion to first-class abstraction 
> status. Along that vein I was thinking of defining a new range: 
> VLERange, i.e. Variable Length Encoding Range. Such a range would have 
> the power somewhere in between bidirectional and random access.
> 
> The primitives offered would include empty, access to front and back, 
> popFront and popBack (just like BidirectionalRange), and in addition 
> properties typical of random access ranges: indexing, slicing, and 
> length. Note that the result of the indexing operator is not the same 
> as the element type of the range, as it only represents the unit of 
> encoding.

Seems like a good idea to define things formally.


> In addition to these (and connecting the two), a VLERange would offer 
> two additional primitives:
> 
> 1. size_t stepSize(size_t offset) gives the length of the step needed 
> to skip to the next element.
> 
> 2. size_t backstepSize(size_t offset) gives the size of the _backward_ 
> step that goes to the previous element.

I like the idea, but I'm not sure about this interface. What's the 
result of stepSize if your range must create two elements from one 
underlying unit? Perhaps in those cases the element type could be an 
array (to return more than one element from one iteration).

For instance, say we have a conversion range taking a Unicode string 
and converting it to ISO Latin 1. The best (lossy) conversion for "œ" 
is "oe" (one chararacter to two characters), in this case 'front' could 
simply return "oe" (two characters) in one iteration, with stepSize 
being the size of the "œ" code point. In the same conversion process, 
encountering "e" followed by a combining "´" would return pre-combined 
character "é" (two characters to one character).


> In both cases, offset is assumed to be at the beginning of a logical 
> element of the range.
> 
> I suspect that a lot of functions in std.string can be written without 
> Unicode-specific knowledge just by relying on such an interface. 
> Moreover, algorithms can be generalized to other structures that use 
> variable-length encoding, such as those used in data compression. (In 
> that case, the support would be a bit array and the encoded type would 
> be ubyte.)

Applicability to other problems seems like a valuable benefit.


> Writing to such ranges is not addressed by this design. Ideas are welcome.

Writing, as in assigning to 'front'? That's not really possible with 
variable-length units as it'd need to shift everything in case of a 
length difference. Or maybe you meant writing as in having an output 
range for variable-length elements... I'm not sure


> Adding VLERange would legitimize strings and would clarify their 
> handling, at the cost of adding one additional concept that needs to be 
> minded. Is the trade-off worthwhile?

In my opinion it's not a trade-off at all, it's a formalization of how 
strings are handled which is better in every regard than a "special 
case". I welcome this move very much.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 11, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Mon, 10 Jan 2011 22:57:36 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail@erdani.org> wrote:

> I've been thinking on how to better deal with Unicode strings. Currently  
> strings are formally bidirectional ranges with a surreptitious random  
> access interface. The random access interface accesses the support of  
> the string, which is understood to hold data in a variable-encoded  
> format. For as long as the programmer understands this relationship,  
> code for string manipulation can be written with relative ease. However,  
> there is still room for writing wrong code that looks legit.
>
> Sometimes the best way to tackle a hairy reality is to invite it to the  
> negotiation table and offer it promotion to first-class abstraction  
> status. Along that vein I was thinking of defining a new range:  
> VLERange, i.e. Variable Length Encoding Range. Such a range would have  
> the power somewhere in between bidirectional and random access.
>
> The primitives offered would include empty, access to front and back,  
> popFront and popBack (just like BidirectionalRange), and in addition  
> properties typical of random access ranges: indexing, slicing, and  
> length. Note that the result of the indexing operator is not the same as  
> the element type of the range, as it only represents the unit of  
> encoding.
>
> In addition to these (and connecting the two), a VLERange would offer  
> two additional primitives:
>
> 1. size_t stepSize(size_t offset) gives the length of the step needed to  
> skip to the next element.
>
> 2. size_t backstepSize(size_t offset) gives the size of the _backward_  
> step that goes to the previous element.
>
> In both cases, offset is assumed to be at the beginning of a logical  
> element of the range.
>
> I suspect that a lot of functions in std.string can be written without  
> Unicode-specific knowledge just by relying on such an interface.  
> Moreover, algorithms can be generalized to other structures that use  
> variable-length encoding, such as those used in data compression. (In  
> that case, the support would be a bit array and the encoded type would  
> be ubyte.)
>
> Writing to such ranges is not addressed by this design. Ideas are  
> welcome.
>
> Adding VLERange would legitimize strings and would clarify their  
> handling, at the cost of adding one additional concept that needs to be  
> minded. Is the trade-off worthwhile?

While this makes it possible to write algorithms that only accept  
VLERanges, I don't think it solves the major problem with strings -- they  
are treated as arrays by the compiler.

I'd also rather see an indexing operation return the element type, and  
have a separate function to get the encoding unit.  This makes more sense  
for generic code IMO.

I noticed you never commented on my proposed string type...

That reminds me, I should update with suggested changes and re-post it.

-Steve
January 11, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/11/11 4:41 AM, Michel Fortin wrote:
> On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>> In addition to these (and connecting the two), a VLERange would offer
>> two additional primitives:
>>
>> 1. size_t stepSize(size_t offset) gives the length of the step needed
>> to skip to the next element.
>>
>> 2. size_t backstepSize(size_t offset) gives the size of the _backward_
>> step that goes to the previous element.
>
> I like the idea, but I'm not sure about this interface. What's the
> result of stepSize if your range must create two elements from one
> underlying unit? Perhaps in those cases the element type could be an
> array (to return more than one element from one iteration).
>
> For instance, say we have a conversion range taking a Unicode string and
> converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
> "oe" (one chararacter to two characters), in this case 'front' could
> simply return "oe" (two characters) in one iteration, with stepSize
> being the size of the "œ" code point. In the same conversion process,
> encountering "e" followed by a combining "´" would return pre-combined
> character "é" (two characters to one character).

In the design as I thought of it, the effective length of one logical 
element is one or more representation units. My understanding is that 
you are referring to a fractional number of representation units for one 
logical element.

>> Writing to such ranges is not addressed by this design. Ideas are
>> welcome.
>
> Writing, as in assigning to 'front'? That's not really possible with
> variable-length units as it'd need to shift everything in case of a
> length difference. Or maybe you meant writing as in having an output
> range for variable-length elements... I'm not sure

Well all of the above :o). Clearly assigning to e.g. front or back 
should not work. The question is what kind of API can we provide beyond 
simple append with put().


Andrei
January 11, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/11/11 5:30 AM, Steven Schveighoffer wrote:
> While this makes it possible to write algorithms that only accept
> VLERanges, I don't think it solves the major problem with strings --
> they are treated as arrays by the compiler.

Except when they're not - foreach with dchar...

> I'd also rather see an indexing operation return the element type, and
> have a separate function to get the encoding unit. This makes more sense
> for generic code IMO.

But that's neither here nor there. That would return the logical element 
at a physical position. I am very doubtful that much generic code could 
work without knowing they are in fact dealing with a variable-length 
encoding.

> I noticed you never commented on my proposed string type...
>
> That reminds me, I should update with suggested changes and re-post it.

To be frank, I think it didn't mark a visible improvement. It solved 
some problems and brought others. There was disagreement over the 
offered primitives and their semantics.

That being said, it's good you are doing this work. In the best case, 
you could bring a compelling abstraction to the table. In the worst, 
you'll become as happy about D's strings as I am :o).


Andrei
January 11, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/11/2011 02:30 PM, Steven Schveighoffer wrote:
> On Mon, 10 Jan 2011 22:57:36 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>
>> I've been thinking on how to better deal with Unicode strings.
>> Currently strings are formally bidirectional ranges with a
>> surreptitious random access interface. The random access interface
>> accesses the support of the string, which is understood to hold data
>> in a variable-encoded format. For as long as the programmer
>> understands this relationship, code for string manipulation can be
>> written with relative ease. However, there is still room for writing
>> wrong code that looks legit.
>>
>> Sometimes the best way to tackle a hairy reality is to invite it to
>> the negotiation table and offer it promotion to first-class
>> abstraction status. Along that vein I was thinking of defining a new
>> range: VLERange, i.e. Variable Length Encoding Range. Such a range
>> would have the power somewhere in between bidirectional and random
>> access.
>>
>> The primitives offered would include empty, access to front and back,
>> popFront and popBack (just like BidirectionalRange), and in addition
>> properties typical of random access ranges: indexing, slicing, and
>> length. Note that the result of the indexing operator is not the same
>> as the element type of the range, as it only represents the unit of
>> encoding.
>>
>> In addition to these (and connecting the two), a VLERange would offer
>> two additional primitives:
>>
>> 1. size_t stepSize(size_t offset) gives the length of the step needed
>> to skip to the next element.
>>
>> 2. size_t backstepSize(size_t offset) gives the size of the _backward_
>> step that goes to the previous element.
>>
>> In both cases, offset is assumed to be at the beginning of a logical
>> element of the range.
>>
>> I suspect that a lot of functions in std.string can be written without
>> Unicode-specific knowledge just by relying on such an interface.
>> Moreover, algorithms can be generalized to other structures that use
>> variable-length encoding, such as those used in data compression. (In
>> that case, the support would be a bit array and the encoded type would
>> be ubyte.)
>>
>> Writing to such ranges is not addressed by this design. Ideas are
>> welcome.
>>
>> Adding VLERange would legitimize strings and would clarify their
>> handling, at the cost of adding one additional concept that needs to
>> be minded. Is the trade-off worthwhile?
>
> While this makes it possible to write algorithms that only accept
> VLERanges, I don't think it solves the major problem with strings --
> they are treated as arrays by the compiler.
>
> I'd also rather see an indexing operation return the element type, and
> have a separate function to get the encoding unit. This makes more sense
> for generic code IMO.
>
> I noticed you never commented on my proposed string type...
>
> That reminds me, I should update with suggested changes and re-post it.

People interested in solving the general problem with Unicode strings 
may have a look at https://bitbucket.org/denispir/denispir-d. All 
constructive feedback welcome.
(This will be asked for review in a short while. The main / client 
interface module is Text.d. A (long) presentation of the issues, 
reasons, solution can be found in the text called "U missing level of 
abstraction")

Denis
_________________
vita es estrany
spir.wikidot.com
January 11, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/11/2011 05:36 PM, Andrei Alexandrescu wrote:
> On 1/11/11 4:41 AM, Michel Fortin wrote:
>> On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail@erdani.org> said:
>>> In addition to these (and connecting the two), a VLERange would offer
>>> two additional primitives:
>>>
>>> 1. size_t stepSize(size_t offset) gives the length of the step needed
>>> to skip to the next element.
>>>
>>> 2. size_t backstepSize(size_t offset) gives the size of the _backward_
>>> step that goes to the previous element.
>>
>> I like the idea, but I'm not sure about this interface. What's the
>> result of stepSize if your range must create two elements from one
>> underlying unit? Perhaps in those cases the element type could be an
>> array (to return more than one element from one iteration).
>>
>> For instance, say we have a conversion range taking a Unicode string and
>> converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
>> "oe" (one chararacter to two characters), in this case 'front' could
>> simply return "oe" (two characters) in one iteration, with stepSize
>> being the size of the "œ" code point. In the same conversion process,
>> encountering "e" followed by a combining "´" would return pre-combined
>> character "é" (two characters to one character).
>
> In the design as I thought of it, the effective length of one logical
> element is one or more representation units. My understanding is that
> you are referring to a fractional number of representation units for one
> logical element.

I think Michel is right. If I understand correctly, VLERange addresses 
the low-level and rather simple issue of each codepoint beeing encoding 
as a variable number of code units. Right?
If yes, then what is the advantage of VLERange? D already has 
string/wstring/dstring, allowing to work with the most advatageous 
encoding according to given source data, and dstring abstracting from 
low-level encoding issues.

The main (and massively ignored) issue when manipulating unicode text is 
rather that, unlike with legacy character sets, one codepoint does *not* 
represent a character in the common sense. In character sets like latin-1:
* each code represents a character, in the common sense (eg "à")
* each character representation has the same size (1 or 2 bytes)
* each character has a single representation ("à" --> always 0xe0)
All of this is wrong with unicode. And these are complicated and 
high-level issues, that appear _after_ decoding, on codepoint sequences.

If VLERange is helpful is dealing with those problems, then I don't 
understand your presentation, sorry. Do you for instance mean such a 
range would, under the hood, group together codes belonging to the same 
character (thus making indexing meaningful), and/or normalise (decomp & 
order) (thus allowing to comp/find/count correctly).?


denis
_________________
vita es estrany
spir.wikidot.com
January 11, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/11/11 9:09 AM, spir wrote:
> On 01/11/2011 05:36 PM, Andrei Alexandrescu wrote:
>> On 1/11/11 4:41 AM, Michel Fortin wrote:
>>> On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
>>> <SeeWebsiteForEmail@erdani.org> said:
>>>> In addition to these (and connecting the two), a VLERange would offer
>>>> two additional primitives:
>>>>
>>>> 1. size_t stepSize(size_t offset) gives the length of the step needed
>>>> to skip to the next element.
>>>>
>>>> 2. size_t backstepSize(size_t offset) gives the size of the _backward_
>>>> step that goes to the previous element.
>>>
>>> I like the idea, but I'm not sure about this interface. What's the
>>> result of stepSize if your range must create two elements from one
>>> underlying unit? Perhaps in those cases the element type could be an
>>> array (to return more than one element from one iteration).
>>>
>>> For instance, say we have a conversion range taking a Unicode string and
>>> converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
>>> "oe" (one chararacter to two characters), in this case 'front' could
>>> simply return "oe" (two characters) in one iteration, with stepSize
>>> being the size of the "œ" code point. In the same conversion process,
>>> encountering "e" followed by a combining "´" would return pre-combined
>>> character "é" (two characters to one character).
>>
>> In the design as I thought of it, the effective length of one logical
>> element is one or more representation units. My understanding is that
>> you are referring to a fractional number of representation units for one
>> logical element.
>
> I think Michel is right. If I understand correctly, VLERange addresses
> the low-level and rather simple issue of each codepoint beeing encoding
> as a variable number of code units. Right?
> If yes, then what is the advantage of VLERange? D already has
> string/wstring/dstring, allowing to work with the most advatageous
> encoding according to given source data, and dstring abstracting from
> low-level encoding issues.

It' not about the data, it's about algorithms. Currently there are 
algorithms that ostensibly work for bidirectional ranges, but internally 
"cheat" by detecting that the input is actually a string, and use that 
knowledge for better implementations.

The benefit of VLERange would that that it legitimizes those algorithms. 
I wouldn't be surprised if an entire class of algorithms would in fact 
require VLERange (e.g. many of those that we commonly consider today 
"string" algorithms).

> The main (and massively ignored) issue when manipulating unicode text is
> rather that, unlike with legacy character sets, one codepoint does *not*
> represent a character in the common sense. In character sets like latin-1:
> * each code represents a character, in the common sense (eg "à")
> * each character representation has the same size (1 or 2 bytes)
> * each character has a single representation ("à" --> always 0xe0)
> All of this is wrong with unicode. And these are complicated and
> high-level issues, that appear _after_ decoding, on codepoint sequences.
>
> If VLERange is helpful is dealing with those problems, then I don't
> understand your presentation, sorry. Do you for instance mean such a
> range would, under the hood, group together codes belonging to the same
> character (thus making indexing meaningful), and/or normalise (decomp &
> order) (thus allowing to comp/find/count correctly).?

VLERange would offer automatic decoding in front, back, popFront, and 
popBack - just like BidirectionalRange does right now. It would also 
offer access to the representational support by means of indexing - also 
like char[] et al already do now. The difference is that VLERange being 
a formal concept, algorithms can specialize on it instead of (a) 
specializing for UTF strings or (b) specializing for BidirectionalRange 
and then manually detecting isSomeString inside. Conversely, when 
defining an algorithm you can specify VLARange as a requirement. 
Boyer-Moore is a perfect example - it doesn't work on bidirectional 
ranges, but it does work on VLARange. I suspect there are many like it.

Of course, it would help a lot if we figured other remarkable VLARanges. 
Here are a few that come to mind:

* Multibyte encodings other than UTF. Currently we have no special 
support for those beyond e.g. forward or bidirectional ranges.

* Huffman, RLE, LZ encoded buffers (and many other compressed formats)

* Vocabulary-based translation systems, e.g. associate each word with a 
number.

* Others...?

Some of these are forward-only (don't allow bidirectional access). Once 
we have a number of examples, it would be great to figure a number of 
remarkable algorithms operating on them.


Andrei
January 11, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-11 11:36:54 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail@erdani.org> said:

> On 1/11/11 4:41 AM, Michel Fortin wrote:
>> For instance, say we have a conversion range taking a Unicode string and
>> converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
>> "oe" (one chararacter to two characters), in this case 'front' could
>> simply return "oe" (two characters) in one iteration, with stepSize
>> being the size of the "œ" code point. In the same conversion process,
>> encountering "e" followed by a combining "´" would return pre-combined
>> character "é" (two characters to one character).
> 
> In the design as I thought of it, the effective length of one logical 
> element is one or more representation units. My understanding is that 
> you are referring to a fractional number of representation units for 
> one logical element.

Your understanding is correct.

I think both cases (one becomes many & many becomes one) are important 
and must be supported. Your proposal only deal with the 
many-becomes-one case.

I proposed returning arrays so we can deal with the one-becomes-many 
case ("œ" becoming "oe"). Another idea would be to introduce 
"substeps". When checking for the next character, in addition to 
determining its step length you could also determine the number of 
substeps in it. "œ" would have two substeps, "o" and "e", and when 
there is no longer any substep you move to the next step.

All this said, I think this should stay an implementation detail as 
this would allow a variety of strategies. Also, keeping this an 
implementation detail means that your proposed 'stepSize' and 
'backstepSize' need to be an implementation detail too (because they 
won't make sense for the one-to-many case). So they can't really be 
part of a standard VLE interface.

As far as I know, all we really need to expose to algorithms is whether 
a range has elements of variable length, because this has an impact on 
your indexing capabilities. The rest seems unnecessary to me, or am I 
missing some use cases?

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 11, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Tue, 11 Jan 2011 11:54:08 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail@erdani.org> wrote:

> On 1/11/11 5:30 AM, Steven Schveighoffer wrote:
>> While this makes it possible to write algorithms that only accept
>> VLERanges, I don't think it solves the major problem with strings --
>> they are treated as arrays by the compiler.
>
> Except when they're not - foreach with dchar...

This solitary difference is a very thin argument -- foreach(d;  
byDchar(str)) would be just as good without requiring compiler help.

>
>> I'd also rather see an indexing operation return the element type, and
>> have a separate function to get the encoding unit. This makes more sense
>> for generic code IMO.
>
> But that's neither here nor there. That would return the logical element  
> at a physical position. I am very doubtful that much generic code could  
> work without knowing they are in fact dealing with a variable-length  
> encoding.

It depends on the function, and the way the indexing is implemented.

>> I noticed you never commented on my proposed string type...
>>
>> That reminds me, I should update with suggested changes and re-post it.
>
> To be frank, I think it didn't mark a visible improvement. It solved  
> some problems and brought others. There was disagreement over the  
> offered primitives and their semantics.

It is supposed to be simple, and provide the expected interface, without  
causing any undue performance degradation.  That is, I should be able to  
do all the things with a replacement string type that I can with a char  
array today, as efficiently as I can today, except I should have to work  
to get at the code-units.  The huge benefit is that I can say "I'm dealing  
with this as an array" when I know it's safe

The disagreement will never be fully solved, as there is just as much  
disagreement about the current state of affairs ;)  e.g. should foreach  
default to using dchar?

> That being said, it's good you are doing this work. In the best case,  
> you could bring a compelling abstraction to the table. In the worst,  
> you'll become as happy about D's strings as I am :o).

I don't think I'll ever be 'happy' with the way strings sit in phobos  
currently.  I typically deal in ASCII (i.e. code units), and phobos works  
very hard to prevent that.

-Steve
« First   ‹ Prev
1 2 3 4 5
Top | Discussion index | About this forum | D home