January 18, 2011
On 1/17/11 5:13 PM, spir wrote:
> On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:
>> * Line 130: representing a text as a dchar[][] has its advantages but
>> major efficiency issues. To be frank I think it's a disaster. I think a
>> representation building on UTF strings directly is bound to be vastly
>> better.
>
> I don't understand your point. Where is the difference with D's builtin
> types, then?

Unfortunately I won't have much time to discuss all these points, but this is a simple one: using dchar[][] wastes memory and time. You need to build on a flatter representation. Don't confuse the abstraction you are building with its underlying representation. The difference between your abstraction and char[]/wchar[]/dchar[] (which I strongly recommend you to build on) is that the abstractions offer different, higher-level primitives that the representation doesn't.

Let me repeat again: if anyone in this community wants to put work in a forward range that iterates one grapheme at a time, that work would be very valuable because it will allow us to experiment with graphemes in a non-disruptive way while benefiting of a host of algorithms. ByGrapheme and friends will help more than defining new string types.


Andrei
January 18, 2011
On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin@michelf.com> said:

> More seriously, you have four choice:
> 
> 1. code unit
> 2. code point
> 3. grapheme
> 4. require the client to state explicitly which kind of 'character' he wants; 'character' being an overloaded word, it's reasonable to ask for disambiguation.

This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points.

For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point".

So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want.

I'm not sure if this is a good idea, but I thought I should at least share my experience.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 18, 2011
Thanks to all that has contributed, I am also following this thread with great interest. :)

Michel Fortin wrote:
> I mean, a grapheme is a slice of a string, can have multiple code points
> (like a string), can be appended the same way as a string, can be
> composed or decomposed using canonical normalization or compatibility
> normalization (like a string), and should be sorted, uppercased, and
> lowercased according to Unicode rules (like a string). Basically, a
> grapheme is just a string that happens to contain only one grapheme.

I would like to stress the fact that Unicode knows nothing about sorting, uppercasing, or lowercasing.

Those operations are tied to the alphabet (or writing system) that a certain grapheme happens to belong to at a given time. For example, we cannot uppercase the letter i without knowing what alphabet we are dealing with. Two possibilities: I and İ (I dot above).

It is the same issue with sorting.

Ali
January 18, 2011
On Monday 17 January 2011 15:13:42 spir wrote:
> See range bug evoked above. opApply is the only workaround AFAIK.
> Also, ranges cannot yet provide indexed iteration like
> 	foreach(i, char ; text) {...}

While it would be nice at times to be able to have an index with foreach when using ranges, I would point out that it's trivial to just declare a variable which you increment each iteration, so it's easy to get an index even when using foreach with ranges. Certainly, I wouldn't consider the lack of index with foreach and ranges a good reason to use opApply instead of ranges. There may be other reasons which make it worthwhile, but it's so trivial to get an index that the loss of range abilities (particularly the ability to use such ranges with std.algorithm) dwarfs it in importance.

- Jonathan M Davis
January 18, 2011
On 1/17/11 11:48 PM, Jonathan M Davis wrote:
> On Monday 17 January 2011 15:13:42 spir wrote:
>> See range bug evoked above. opApply is the only workaround AFAIK.
>> Also, ranges cannot yet provide indexed iteration like
>> 	foreach(i, char ; text) {...}
>
> While it would be nice at times to be able to have an index with foreach when
> using ranges, I would point out that it's trivial to just declare a variable
> which you increment each iteration, so it's easy to get an index even when using
> foreach with ranges. Certainly, I wouldn't consider the lack of index with
> foreach and ranges a good reason to use opApply instead of ranges. There may be
> other reasons which make it worthwhile, but it's so trivial to get an index that
> the loss of range abilities (particularly the ability to use such ranges with
> std.algorithm) dwarfs it in importance.
>
> - Jonathan M Davis

It's a bit more difficult than that. When iterating a variable-length encoded range, what you need more than the current item being iterated is the physical offset reached inside the range. That's not all that difficult either as the range can always provide an extra primitive, but a bit annoying (e.g. because it makes iteration with foreach impossible if you want the index, unless you return a tuple with each step).

At any rate, I agree with two things - one, we need to fix the foreach situation. Two, even before we find a fix, at this point committing to iteration with opApply essentially commits the iteratee to an island where all basic algorithms need to be reinvented from first principles.


Andrei
January 18, 2011
On 1/17/11 9:48 PM, Michel Fortin wrote:
> On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin@michelf.com>
> said:
>
>> More seriously, you have four choice:
>>
>> 1. code unit
>> 2. code point
>> 3. grapheme
>> 4. require the client to state explicitly which kind of 'character' he
>> wants; 'character' being an overloaded word, it's reasonable to ask
>> for disambiguation.
>
> This makes me think of what I did with my XML parser after you made code
> points the element type for strings. Basically, the parser now uses
> 'front' and 'popFront' whenever it needs to get the next code point, but
> most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
> had to add) when testing for or skipping an ASCII character is
> sufficient. This way I avoid a lot of unnecessary decoding of code points.
>
> For this to work, the same range must let you skip either a unit or a
> code point. If I were using a separate range with a call to toDchar or
> toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
> have helped much because the new range would essentially become a new
> slice independent of the original, so you can't interleave "I want to
> advance by one unit" with "I want to advance by one code point".
>
> So perhaps the best interface for strings would be to provide multiple
> range-like interfaces that you can use at the level you want.
>
> I'm not sure if this is a good idea, but I thought I should at least
> share my experience.

Very insightful. Thanks for sharing. Code it up and make a solid proposal!

Andrei
January 18, 2011
On 18/01/11 16:46, Andrei Alexandrescu wrote:
> On 1/17/11 9:48 PM, Michel Fortin wrote:
>> On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin@michelf.com>
>> said:
>>
>>> More seriously, you have four choice:
>>>
>>> 1. code unit
>>> 2. code point
>>> 3. grapheme
>>> 4. require the client to state explicitly which kind of 'character' he
>>> wants; 'character' being an overloaded word, it's reasonable to ask
>>> for disambiguation.
>>
>> This makes me think of what I did with my XML parser after you made code
>> points the element type for strings. Basically, the parser now uses
>> 'front' and 'popFront' whenever it needs to get the next code point, but
>> most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
>> had to add) when testing for or skipping an ASCII character is
>> sufficient. This way I avoid a lot of unnecessary decoding of code
>> points.
>>
>> For this to work, the same range must let you skip either a unit or a
>> code point. If I were using a separate range with a call to toDchar or
>> toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
>> have helped much because the new range would essentially become a new
>> slice independent of the original, so you can't interleave "I want to
>> advance by one unit" with "I want to advance by one code point".
>>
>> So perhaps the best interface for strings would be to provide multiple
>> range-like interfaces that you can use at the level you want.
>>
>> I'm not sure if this is a good idea, but I thought I should at least
>> share my experience.
>
> Very insightful. Thanks for sharing. Code it up and make a solid proposal!
>
> Andrei

How does this differ from Steve Schveighoffer's string_t, subtract the indexing and slicing of code-points, plus a bidirectional grapheme range?

January 18, 2011
On 01/18/2011 04:48 AM, Michel Fortin wrote:
> On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin@michelf.com>
> said:
>
>> More seriously, you have four choice:
>>
>> 1. code unit
>> 2. code point
>> 3. grapheme
>> 4. require the client to state explicitly which kind of 'character' he
>> wants; 'character' being an overloaded word, it's reasonable to ask
>> for disambiguation.
>
> This makes me think of what I did with my XML parser after you made code
> points the element type for strings. Basically, the parser now uses
> 'front' and 'popFront' whenever it needs to get the next code point, but
> most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
> had to add) when testing for or skipping an ASCII character is
> sufficient. This way I avoid a lot of unnecessary decoding of code points.
>
> For this to work, the same range must let you skip either a unit or a
> code point. If I were using a separate range with a call to toDchar or
> toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
> have helped much because the new range would essentially become a new
> slice independent of the original, so you can't interleave "I want to
> advance by one unit" with "I want to advance by one code point".
>
> So perhaps the best interface for strings would be to provide multiple
> range-like interfaces that you can use at the level you want.
>
> I'm not sure if this is a good idea, but I thought I should at least
> share my experience.

This looks like a very interesting approach. And clear.
I guess range synchronisation would be based on an internal lowest-level (codeunit) index. Then, you also need internal validity-checking and/or offseting routines when a higher-level range is used after a lowel-level one has been used. (I mean eg to ensure start-of-codepoint after a codeunit popFront, or throw an error.)
Also, how to avoid duplicating many operational functions (eg find a given slice) for each level?

Denis
_________________
vita es estrany
spir.wikidot.com

January 18, 2011
On 01/18/2011 06:48 AM, Jonathan M Davis wrote:
> On Monday 17 January 2011 15:13:42 spir wrote:
>> See range bug evoked above. opApply is the only workaround AFAIK.
>> Also, ranges cannot yet provide indexed iteration like
>> 	foreach(i, char ; text) {...}
>
> While it would be nice at times to be able to have an index with foreach when
> using ranges, I would point out that it's trivial to just declare a variable
> which you increment each iteration, so it's easy to get an index even when using
> foreach with ranges. Certainly, I wouldn't consider the lack of index with
> foreach and ranges a good reason to use opApply instead of ranges. There may be
> other reasons which make it worthwhile, but it's so trivial to get an index that
> the loss of range abilities (particularly the ability to use such ranges with
> std.algorithm) dwarfs it in importance.

You are right. I fully agree, in fact. On the other hand, think at expectations of users of a library providing iteration on "naturally" sequential thingies. The point is that D makes indexed iteration available elsewhere.

Denis
_________________
vita es estrany
spir.wikidot.com

January 18, 2011
On 01/18/2011 07:11 AM, Andrei Alexandrescu wrote:
> On 1/17/11 11:48 PM, Jonathan M Davis wrote:
>> On Monday 17 January 2011 15:13:42 spir wrote:
>>> See range bug evoked above. opApply is the only workaround AFAIK.
>>> Also, ranges cannot yet provide indexed iteration like
>>> foreach(i, char ; text) {...}
>>
>> While it would be nice at times to be able to have an index with
>> foreach when
>> using ranges, I would point out that it's trivial to just declare a
>> variable
>> which you increment each iteration, so it's easy to get an index even
>> when using
>> foreach with ranges. Certainly, I wouldn't consider the lack of index
>> with
>> foreach and ranges a good reason to use opApply instead of ranges.
>> There may be
>> other reasons which make it worthwhile, but it's so trivial to get an
>> index that
>> the loss of range abilities (particularly the ability to use such
>> ranges with
>> std.algorithm) dwarfs it in importance.
>>
>> - Jonathan M Davis
>
> It's a bit more difficult than that. When iterating a variable-length
> encoded range, what you need more than the current item being iterated
> is the physical offset reached inside the range. That's not all that
> difficult either as the range can always provide an extra primitive, but
> a bit annoying (e.g. because it makes iteration with foreach impossible
> if you want the index, unless you return a tuple with each step).

This is a very valid point: a range's logical offset is not necessary equal to physical (hum) offset, even on a plain sequence.
But for the case of Text it is in fact, precisely because codepoints have been grouped in "piles" each representing true character (grapheme). This is actually one third of the purpose of Text (the others beeing to ensure unique representation of each character, and to provide users with clear interface).
Thus, Jonathan's point simply applies to Text.

> At any rate, I agree with two things - one, we need to fix the foreach
> situation. Two, even before we find a fix, at this point committing to
> iteration with opApply essentially commits the iteratee to an island
> where all basic algorithms need to be reinvented from first principles.

I agree. The situation would be different if D had not proposed indexed iteration already, and programmers would routinely count manually and/or call an extra range primitive, as you say.

Upon using opApply: it works fine nevertheless, at least for a first rough implementation like in the case of Text.
Reinventing basic algos is not an issue at this stage, as long as they are simple enough, and mainly for testing. (Actually, it can be an advantage in avoiding integration issues, possibly due to D's current beta stage --I mean bugs that pop up only when combinng given features-- like we had eg with range & formatValue).

> Andrei

Denis
_________________
vita es estrany
spir.wikidot.com