View mode: basic / threaded / horizontal-split · Log in · Help
January 18, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/17/11 5:13 PM, spir wrote:
> On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:
>> * Line 130: representing a text as a dchar[][] has its advantages but
>> major efficiency issues. To be frank I think it's a disaster. I think a
>> representation building on UTF strings directly is bound to be vastly
>> better.
>
> I don't understand your point. Where is the difference with D's builtin
> types, then?

Unfortunately I won't have much time to discuss all these points, but 
this is a simple one: using dchar[][] wastes memory and time. You need 
to build on a flatter representation. Don't confuse the abstraction you 
are building with its underlying representation. The difference between 
your abstraction and char[]/wchar[]/dchar[] (which I strongly recommend 
you to build on) is that the abstractions offer different, higher-level 
primitives that the representation doesn't.

Let me repeat again: if anyone in this community wants to put work in a 
forward range that iterates one grapheme at a time, that work would be 
very valuable because it will allow us to experiment with graphemes in a 
non-disruptive way while benefiting of a host of algorithms. ByGrapheme 
and friends will help more than defining new string types.


Andrei
January 18, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin@michelf.com> said:

> More seriously, you have four choice:
> 
> 1. code unit
> 2. code point
> 3. grapheme
> 4. require the client to state explicitly which kind of 'character' he 
> wants; 'character' being an overloaded word, it's reasonable to ask for 
> disambiguation.

This makes me think of what I did with my XML parser after you made 
code points the element type for strings. Basically, the parser now 
uses 'front' and 'popFront' whenever it needs to get the next code 
point, but most of the time it uses 'frontUnit' and 'popFrontUnit' 
instead (which I had to add) when testing for or skipping an ASCII 
character is sufficient. This way I avoid a lot of unnecessary decoding 
of code points.

For this to work, the same range must let you skip either a unit or a 
code point. If I were using a separate range with a call to toDchar or 
toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't 
have helped much because the new range would essentially become a new 
slice independent of the original, so you can't interleave "I want to 
advance by one unit" with "I want to advance by one code point".

So perhaps the best interface for strings would be to provide multiple 
range-like interfaces that you can use at the level you want.

I'm not sure if this is a good idea, but I thought I should at least 
share my experience.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 18, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
Thanks to all that has contributed, I am also following this thread with 
great interest. :)

Michel Fortin wrote:
> I mean, a grapheme is a slice of a string, can have multiple code points
> (like a string), can be appended the same way as a string, can be
> composed or decomposed using canonical normalization or compatibility
> normalization (like a string), and should be sorted, uppercased, and
> lowercased according to Unicode rules (like a string). Basically, a
> grapheme is just a string that happens to contain only one grapheme.

I would like to stress the fact that Unicode knows nothing about 
sorting, uppercasing, or lowercasing.

Those operations are tied to the alphabet (or writing system) that a 
certain grapheme happens to belong to at a given time. For example, we 
cannot uppercase the letter i without knowing what alphabet we are 
dealing with. Two possibilities: I and İ (I dot above).

It is the same issue with sorting.

Ali
January 18, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Monday 17 January 2011 15:13:42 spir wrote:
> See range bug evoked above. opApply is the only workaround AFAIK.
> Also, ranges cannot yet provide indexed iteration like
> 	foreach(i, char ; text) {...}

While it would be nice at times to be able to have an index with foreach when 
using ranges, I would point out that it's trivial to just declare a variable 
which you increment each iteration, so it's easy to get an index even when using 
foreach with ranges. Certainly, I wouldn't consider the lack of index with 
foreach and ranges a good reason to use opApply instead of ranges. There may be 
other reasons which make it worthwhile, but it's so trivial to get an index that 
the loss of range abilities (particularly the ability to use such ranges with 
std.algorithm) dwarfs it in importance.

- Jonathan M Davis
January 18, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/17/11 11:48 PM, Jonathan M Davis wrote:
> On Monday 17 January 2011 15:13:42 spir wrote:
>> See range bug evoked above. opApply is the only workaround AFAIK.
>> Also, ranges cannot yet provide indexed iteration like
>> 	foreach(i, char ; text) {...}
>
> While it would be nice at times to be able to have an index with foreach when
> using ranges, I would point out that it's trivial to just declare a variable
> which you increment each iteration, so it's easy to get an index even when using
> foreach with ranges. Certainly, I wouldn't consider the lack of index with
> foreach and ranges a good reason to use opApply instead of ranges. There may be
> other reasons which make it worthwhile, but it's so trivial to get an index that
> the loss of range abilities (particularly the ability to use such ranges with
> std.algorithm) dwarfs it in importance.
>
> - Jonathan M Davis

It's a bit more difficult than that. When iterating a variable-length 
encoded range, what you need more than the current item being iterated 
is the physical offset reached inside the range. That's not all that 
difficult either as the range can always provide an extra primitive, but 
a bit annoying (e.g. because it makes iteration with foreach impossible 
if you want the index, unless you return a tuple with each step).

At any rate, I agree with two things - one, we need to fix the foreach 
situation. Two, even before we find a fix, at this point committing to 
iteration with opApply essentially commits the iteratee to an island 
where all basic algorithms need to be reinvented from first principles.


Andrei
January 18, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/17/11 9:48 PM, Michel Fortin wrote:
> On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin@michelf.com>
> said:
>
>> More seriously, you have four choice:
>>
>> 1. code unit
>> 2. code point
>> 3. grapheme
>> 4. require the client to state explicitly which kind of 'character' he
>> wants; 'character' being an overloaded word, it's reasonable to ask
>> for disambiguation.
>
> This makes me think of what I did with my XML parser after you made code
> points the element type for strings. Basically, the parser now uses
> 'front' and 'popFront' whenever it needs to get the next code point, but
> most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
> had to add) when testing for or skipping an ASCII character is
> sufficient. This way I avoid a lot of unnecessary decoding of code points.
>
> For this to work, the same range must let you skip either a unit or a
> code point. If I were using a separate range with a call to toDchar or
> toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
> have helped much because the new range would essentially become a new
> slice independent of the original, so you can't interleave "I want to
> advance by one unit" with "I want to advance by one code point".
>
> So perhaps the best interface for strings would be to provide multiple
> range-like interfaces that you can use at the level you want.
>
> I'm not sure if this is a good idea, but I thought I should at least
> share my experience.

Very insightful. Thanks for sharing. Code it up and make a solid proposal!

Andrei
January 18, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 18/01/11 16:46, Andrei Alexandrescu wrote:
> On 1/17/11 9:48 PM, Michel Fortin wrote:
>> On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin@michelf.com>
>> said:
>>
>>> More seriously, you have four choice:
>>>
>>> 1. code unit
>>> 2. code point
>>> 3. grapheme
>>> 4. require the client to state explicitly which kind of 'character' he
>>> wants; 'character' being an overloaded word, it's reasonable to ask
>>> for disambiguation.
>>
>> This makes me think of what I did with my XML parser after you made code
>> points the element type for strings. Basically, the parser now uses
>> 'front' and 'popFront' whenever it needs to get the next code point, but
>> most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
>> had to add) when testing for or skipping an ASCII character is
>> sufficient. This way I avoid a lot of unnecessary decoding of code
>> points.
>>
>> For this to work, the same range must let you skip either a unit or a
>> code point. If I were using a separate range with a call to toDchar or
>> toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
>> have helped much because the new range would essentially become a new
>> slice independent of the original, so you can't interleave "I want to
>> advance by one unit" with "I want to advance by one code point".
>>
>> So perhaps the best interface for strings would be to provide multiple
>> range-like interfaces that you can use at the level you want.
>>
>> I'm not sure if this is a good idea, but I thought I should at least
>> share my experience.
>
> Very insightful. Thanks for sharing. Code it up and make a solid proposal!
>
> Andrei

How does this differ from Steve Schveighoffer's string_t, subtract the 
indexing and slicing of code-points, plus a bidirectional grapheme range?
January 18, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/18/2011 04:48 AM, Michel Fortin wrote:
> On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin@michelf.com>
> said:
>
>> More seriously, you have four choice:
>>
>> 1. code unit
>> 2. code point
>> 3. grapheme
>> 4. require the client to state explicitly which kind of 'character' he
>> wants; 'character' being an overloaded word, it's reasonable to ask
>> for disambiguation.
>
> This makes me think of what I did with my XML parser after you made code
> points the element type for strings. Basically, the parser now uses
> 'front' and 'popFront' whenever it needs to get the next code point, but
> most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
> had to add) when testing for or skipping an ASCII character is
> sufficient. This way I avoid a lot of unnecessary decoding of code points.
>
> For this to work, the same range must let you skip either a unit or a
> code point. If I were using a separate range with a call to toDchar or
> toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
> have helped much because the new range would essentially become a new
> slice independent of the original, so you can't interleave "I want to
> advance by one unit" with "I want to advance by one code point".
>
> So perhaps the best interface for strings would be to provide multiple
> range-like interfaces that you can use at the level you want.
>
> I'm not sure if this is a good idea, but I thought I should at least
> share my experience.

This looks like a very interesting approach. And clear.
I guess range synchronisation would be based on an internal lowest-level 
(codeunit) index. Then, you also need internal validity-checking and/or 
offseting routines when a higher-level range is used after a lowel-level 
one has been used. (I mean eg to ensure start-of-codepoint after a 
codeunit popFront, or throw an error.)
Also, how to avoid duplicating many operational functions (eg find a 
given slice) for each level?

Denis
_________________
vita es estrany
spir.wikidot.com
January 18, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/18/2011 06:48 AM, Jonathan M Davis wrote:
> On Monday 17 January 2011 15:13:42 spir wrote:
>> See range bug evoked above. opApply is the only workaround AFAIK.
>> Also, ranges cannot yet provide indexed iteration like
>> 	foreach(i, char ; text) {...}
>
> While it would be nice at times to be able to have an index with foreach when
> using ranges, I would point out that it's trivial to just declare a variable
> which you increment each iteration, so it's easy to get an index even when using
> foreach with ranges. Certainly, I wouldn't consider the lack of index with
> foreach and ranges a good reason to use opApply instead of ranges. There may be
> other reasons which make it worthwhile, but it's so trivial to get an index that
> the loss of range abilities (particularly the ability to use such ranges with
> std.algorithm) dwarfs it in importance.

You are right. I fully agree, in fact. On the other hand, think at 
expectations of users of a library providing iteration on "naturally" 
sequential thingies. The point is that D makes indexed iteration 
available elsewhere.

Denis
_________________
vita es estrany
spir.wikidot.com
January 18, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/18/2011 07:11 AM, Andrei Alexandrescu wrote:
> On 1/17/11 11:48 PM, Jonathan M Davis wrote:
>> On Monday 17 January 2011 15:13:42 spir wrote:
>>> See range bug evoked above. opApply is the only workaround AFAIK.
>>> Also, ranges cannot yet provide indexed iteration like
>>> foreach(i, char ; text) {...}
>>
>> While it would be nice at times to be able to have an index with
>> foreach when
>> using ranges, I would point out that it's trivial to just declare a
>> variable
>> which you increment each iteration, so it's easy to get an index even
>> when using
>> foreach with ranges. Certainly, I wouldn't consider the lack of index
>> with
>> foreach and ranges a good reason to use opApply instead of ranges.
>> There may be
>> other reasons which make it worthwhile, but it's so trivial to get an
>> index that
>> the loss of range abilities (particularly the ability to use such
>> ranges with
>> std.algorithm) dwarfs it in importance.
>>
>> - Jonathan M Davis
>
> It's a bit more difficult than that. When iterating a variable-length
> encoded range, what you need more than the current item being iterated
> is the physical offset reached inside the range. That's not all that
> difficult either as the range can always provide an extra primitive, but
> a bit annoying (e.g. because it makes iteration with foreach impossible
> if you want the index, unless you return a tuple with each step).

This is a very valid point: a range's logical offset is not necessary 
equal to physical (hum) offset, even on a plain sequence.
But for the case of Text it is in fact, precisely because codepoints 
have been grouped in "piles" each representing true character 
(grapheme). This is actually one third of the purpose of Text (the 
others beeing to ensure unique representation of each character, and to 
provide users with clear interface).
Thus, Jonathan's point simply applies to Text.

> At any rate, I agree with two things - one, we need to fix the foreach
> situation. Two, even before we find a fix, at this point committing to
> iteration with opApply essentially commits the iteratee to an island
> where all basic algorithms need to be reinvented from first principles.

I agree. The situation would be different if D had not proposed indexed 
iteration already, and programmers would routinely count manually and/or 
call an extra range primitive, as you say.

Upon using opApply: it works fine nevertheless, at least for a first 
rough implementation like in the case of Text.
Reinventing basic algos is not an issue at this stage, as long as they 
are simple enough, and mainly for testing. (Actually, it can be an 
advantage in avoiding integration issues, possibly due to D's current 
beta stage --I mean bugs that pop up only when combinng given features-- 
like we had eg with range & formatValue).

> Andrei

Denis
_________________
vita es estrany
spir.wikidot.com
13 14 15 16 17 18 19
Top | Discussion index | About this forum | D home