View mode: basic / threaded / horizontal-split · Log in · Help
January 19, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/18/11 7:48 PM, Steven Wawryk wrote:
> On 19/01/11 11:37, Andrei Alexandrescu wrote:
>> On 1/18/11 6:00 PM, Steven Wawryk wrote:
>>> Which is exactly what I asked you about. I understand that you must be
>>> very busy, But how do I get you to look at the actual technical content
>>> of something? Is there something in the way I phrase thing that you
>>> dismiss my introductory motivation without looking into the content?
>>>
>>> I don't mean this as a criticism. I really want to know because I'm
>>> considering a proposal on a different topic but wasn't sure it's worth
>>> it as there seems to be a barrier to getting things considered.
>>
>> One simple fact is that I'm not the only person who needs to look at a
>> design. If you want to propose something for inclusion in Phobos, please
>> put the code in good shape, document it properly, and make a submission
>> in this newsgroup following the Boost model. I get one vote and everyone
>> else gets a vote.
>
> Ok, thanks for this suggestion. But if developing a proposal as concrete
> code is a lot of work that may be rejected, is there a way to sound out
> the idea first before deciding to commit to developing it?

This is the best place as far as I know.

>> Looking back at our exchanges in search for a perceived dismissive
>> attitude on my part (apologies if it seems that way - it was
>> unintentional), I infer your annoyance stems from my answer to this:
>>
>>>>> How does this differ from Steve Schveighoffer's string_t,
>>>>> subtract the indexing and slicing of code-points, plus a
>>>>> bidirectional grapheme range?
>
> No, this was just a summary. Here is the post that you answered
> dismissively: news://news.digitalmars.com:119/ih030g$1ok1$1@digitalmars.com

My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a 
response. If you found that dismissive, I'd be hard pressed to improve 
it. To quote myself:

> I believe the proposed scheme:
>
> 1. Changes the language in a major way;
>
> 2. Is highly disruptive;
>
> 3. Improves the status quo in only minor ways.
>
> I'd be much more willing to improve things by e.g. defining the representation() function I talked about a bit ago, and other less disruptive additions.

That took into consideration your amendments.

>  >
>  > In the interest of moving this on, would it become acceptable to you if:
>  >
>  > 1. indexing and slicing of the code-point range were removed?
>  > 2. any additional ranges are exposed to the user according to decisions
>  > made about graphemes, etc?
>  > 3. other constructive criticisms were accommodated?
>  >
>  > Steve
>  >
>  >
>  > On 15/01/11 03:33, Andrei Alexandrescu wrote:
>  >> On 1/14/11 5:06 AM, Steven Schveighoffer wrote:
>  >>> I respectfully disagree. A stream built on fixed-sized units, but with
>  >>> variable length elements, where you can determine the start of an
>  >>> element in O(1) time given a random index absolutely provides
>  >>> random-access. It just doesn't provide length.
>  >>
>  >> I equally respectfully disagree. I think random access is defined as
>  >> accessing the ith element in O(1) time. That's not the case here.
>  >>
>  >> Andrei
>  >
>
>
>> I happen to have discussed at length my beef with Steve's proposal. Now
>> in one sentence you change the proposed design on the fly without
>> fleshing out the consequences, add to it again without substantiation,
>> and presumably expect me to come with a salient analysis of the result.
>> I don't think it's fair to characterize my answer to that as dismissive,
>> nor to pressure me into expanding on it.
>
> Sorry, I could have given more context. But you didn't discuss what I
> asked, based on the observation that your detailed criticisms of Steve's
> proposal all related to a single aspect of it.

I really don't know what to add to make my answer more meaningful.


Andrei
January 19, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 19/01/11 13:53, Andrei Alexandrescu wrote:
> My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a
> response. If you found that dismissive, I'd be hard pressed to improve
> it. To quote myself:
>
>> I believe the proposed scheme:
>>
>> 1. Changes the language in a major way;
>>
>> 2. Is highly disruptive;
>>
>> 3. Improves the status quo in only minor ways.
>>
>> I'd be much more willing to improve things by e.g. defining the
>> representation() function I talked about a bit ago, and other less
>> disruptive additions.
>
> That took into consideration your amendments.

I don't think that it did.  I proposed no language change, nor anything 
disruptive.  The change in status quo I proposed was essentially the 
same one you encouraged here, about a type that gives the user the 
choice of what kind of range to be operated on.  It appears to me that 
you were responding to some perception you had about Steve's full 
proposal (that may have been triggered by something I said in the 
introduction), not what I actually said in the content.

So, I would still be interested to know how to sound out this newsgroup 
with an idea (before coding commitment) and have the suggestions 
considered on something more than a superficial level.

Is the newsgroup too busy?  Should there be people nominated to screen 
ideas that are worth looking at?  Should I use a completely different 
approach?  Your suggestions so far I will take into account, but it 
still looks like there's a barrier to me.


>> Sorry, I could have given more context. But you didn't discuss what I
>> asked, based on the observation that your detailed criticisms of Steve's
>> proposal all related to a single aspect of it.
>
> I really don't know what to add to make my answer more meaningful.
>
>
> Andrei
January 19, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/18/11 9:46 PM, Steven Wawryk wrote:
> On 19/01/11 13:53, Andrei Alexandrescu wrote:
>> My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a
>> response. If you found that dismissive, I'd be hard pressed to improve
>> it. To quote myself:
>>
>>> I believe the proposed scheme:
>>>
>>> 1. Changes the language in a major way;
>>>
>>> 2. Is highly disruptive;
>>>
>>> 3. Improves the status quo in only minor ways.
>>>
>>> I'd be much more willing to improve things by e.g. defining the
>>> representation() function I talked about a bit ago, and other less
>>> disruptive additions.
>>
>> That took into consideration your amendments.
>
> I don't think that it did. I proposed no language change, nor anything
> disruptive.

Adding a new string type would be disruptive. Unless I misunderstood, 
there is still a new string type in Steve's proposal, and one that would 
be the default one, even after the amendments you mentioned. That is a 
problem because people write this:

auto s = "hello";

and the question is, what is the type of s.

 The change in status quo I proposed was essentially the same
> one you encouraged here, about a type that gives the user the choice of
> what kind of range to be operated on. It appears to me that you were
> responding to some perception you had about Steve's full proposal (that
> may have been triggered by something I said in the introduction), not
> what I actually said in the content.

If that's what it is, great. To clarify: no new string type, only a 
range that iterates one grapheme over existing strings.

> So, I would still be interested to know how to sound out this newsgroup
> with an idea (before coding commitment) and have the suggestions
> considered on something more than a superficial level.
>
> Is the newsgroup too busy? Should there be people nominated to screen
> ideas that are worth looking at? Should I use a completely different
> approach? Your suggestions so far I will take into account, but it still
> looks like there's a barrier to me.

My perception is that you want to minimize risks before starting to 
invest work into this. I'm not sure how you can do that.

Andrei
January 19, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
Michel Fortin wrote:
> On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin@michelf.com>
> said:

> So perhaps the best interface for strings would be to provide multiple
> range-like interfaces that you can use at the level you want.

That's what I've been thinking. The users can choose whether they want 
random access or not. A grapheme-aware string can provide random access 
at a space cost, or no random access for efficient space use.

I see 5 layers in string processing. Layers 1 and 2 are currently 
handled by D, sometimes in an unclear way. e.g. char[] may be used as an 
array of code units or an array of code points depending on the type of 
iteration.

1) Code units: This is what D provides with its string types

This layers models RandomAccessRange

2) Code points: This is what D and Phobos provide for example with 
foreach(d; stride(s, 1))

dchar[] models RandomAccessRange at this layer

char[] and wchar[] model ForwardRange at this layer

(If I understand it correctly, Steven Schveighoffer is trying to provide 
a pseudo-RandomAccessRange to char[] and wchar[] with his string type.)

3) Graphemes: This is what the string type that spir is working on. 
There could be at least two types:

3a) RandomAccessGraphemeRange: Has random access but the data type is large

3b) ForwardGraphemeRange: space-efficient but does not provide random access

I think the programmers would be happy to be able to choose.

4) Letters: Uses either 3a or 3b. This is the layer where the idea of a 
writing system enters the picture: lower/upper case transformations and 
sorting happen at this layer. (I have a library that tries to handle 
this layer but is ignorant of graphemes; I am waiting for spir's string 
type. ;))

4a) Models RandomAccessRange if based on a RandomAccessGraphemeRange

4b) Models ForwardRange if based on a ForwardGraphemeRange

5) Text: Collection of Letters. This is where a name like "ali & tim" is 
correctly capitalized as "ALİ & TIM" because the text consists of two 
separate writing systems. (The same library that I mentioned in 4 tries 
to handle this layer as well.)

Ali
January 19, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/18/2011 06:11 AM, Ali Çehreli wrote:
> Thanks to all that has contributed, I am also following this thread with
> great interest. :)
>
> Michel Fortin wrote:
>  > I mean, a grapheme is a slice of a string, can have multiple code points
>  > (like a string), can be appended the same way as a string, can be
>  > composed or decomposed using canonical normalization or compatibility
>  > normalization (like a string), and should be sorted, uppercased, and
>  > lowercased according to Unicode rules (like a string). Basically, a
>  > grapheme is just a string that happens to contain only one grapheme.
>
> I would like to stress the fact that Unicode knows nothing about
> sorting, uppercasing, or lowercasing.
>
> Those operations are tied to the alphabet (or writing system) that a
> certain grapheme happens to belong to at a given time. For example, we
> cannot uppercase the letter i without knowing what alphabet we are
> dealing with. Two possibilities: I and İ (I dot above).
>
> It is the same issue with sorting.

This is true and false ;-)

You are right, indeed, on the fact that issues like sorting one are 
language-specific, and more, use-case-specific. The case of the turkish 
beeing a good example. For another one, in french I do not even know 
whether there is an official rule! Anyway, whatever the answer, even eg 
famous newpapers, and official documents, used different rules. Most of 
them let down accents on uppercase, possibly because of computer 
limitation; there is a recent move (back) toward accented uppercase.
This is very annoying: "Hélène" has 2 consistent and used uppercase 
versions. Conversely, how is software supposed to guess the lowercase 
version of "HELENE"?

Upon Unicode, it still defines norms for casing and so-called collation 
(compare, for sorting) algorithms. Dunno much more, i have never applied 
them, personly, for reasons like the ones above. The full list of it's 
technical docs can be found at http://unicode.org/reports/. See in 
particular http://unicode.org/reports/tr10/ for collation. 
(Unfortnately, case mapping is know part of the core standard doc, so 
that it's hard to get it.)

Denis
_________________
vita es estrany
spir.wikidot.com
January 19, 2011
string repr & range levels [was: Re: VLERange: ...]
On 01/19/2011 08:43 AM, Ali Çehreli wrote:
> Michel Fortin wrote:
>  > On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin@michelf.com>
>  > said:
>
>  > So perhaps the best interface for strings would be to provide multiple
>  > range-like interfaces that you can use at the level you want.
>
> That's what I've been thinking. The users can choose whether they want
> random access or not. A grapheme-aware string can provide random access
> at a space cost, or no random access for efficient space use.
>
> I see 5 layers in string processing. Layers 1 and 2 are currently
> handled by D, sometimes in an unclear way. e.g. char[] may be used as an
> array of code units or an array of code points depending on the type of
> iteration.

This is very good and helpful summary. But you do not list all relevant 
aspects of the question, I guess. Defining which codes belong to a given 
grapheme (what I call "piling") is necessary for true O(1) 
random-access, but not only. More importantly, all operations involving 
equality comparison (find, count, replace,...) require normalisation 
--in addition to piling.
A few notes:

> 1) Code units: This is what D provides with its string types
>
> This layers models RandomAccessRange

This level is pure implementation artifact that simply cannot make any 
sense. (from user and thus programmer points of view)
Any kind of text manipulation (slice, find, replace...) may lead to 
random incorrectness, except when source texts can be guaranteed to hold 
plain ASCII (which may be hard to prove).
Conversely, pieces of text only passed around by an app do not require 
any more costly representation, in terms of time (decoding) or space. In 
addition, concat works provided all pieces share the same encoding 
(ASCII beeing a subset of most historic charsets and of UTF-8).

> 2) Code points: This is what D and Phobos provide for example with
> foreach(d; stride(s, 1))
>
> dchar[] models RandomAccessRange at this layer
>
> char[] and wchar[] model ForwardRange at this layer
>
> (If I understand it correctly, Steven Schveighoffer is trying to provide
> a pseudo-RandomAccessRange to char[] and wchar[] with his string type.)

This level is also a kind of implementation artifact, compared to 
historic charsets, but actually based on a real fact of natural 
languages: they hold composite characters that can thus be coded by 
combining lower-level codes which represent "scripting marks" (base & 
combining ones).
For this reason, this level can have some sense. My latest guess is that 
apps that consider text as a study object (read linguistic apps), 
instead of a means, may regurarly need operating at this level, in 
addition to the next one.
Normalisation can be applied at this level --and is necessary for the 
above kind of use case. But using it for operations requiring compare 
will typically also require "piling", that is the next level, if only to 
determine what is to be compared.

> 3) Graphemes: This is what the string type that spir is working on.
> There could be at least two types:

This is the meaningful level for, probably, nearly all applications.

> 3a) RandomAccessGraphemeRange: Has random access but the data type is large

I guess this is Text's approach? Text is "flash fast" indeed for any 
operation benefiting from random-access. But not only: since it 
normalises its input, it should be far faster for any operation using 
compare (rough evaluations suggest a speed ratio of 1 to 2 orders of 
magnitude).
The cost is high in terms of space, which in turn certainly reduces its 
speed gain in the general case, because to cache (miss) effects. (Thank 
you Michel for making this clear.)

> 3b) ForwardGraphemeRange: space-efficient but does not provide random
> access

Is it what Andrei expects, namely a Grapheme type with a corresponding 
ByGrapheme iterator IIUC?
Time efficiency of operations?

3) metadata RandomAccessGraphemeRange

Michel Fortin suggested (off list) an alternative approach to Text: 
instead of actually "piling" at construction time, just store metadata 
upon grapheme bounds. The core benefit is indeed to keep "normal" text 
storage (meaning *char[], for modification): would this point please 
Andrei better?
I let you evaluate various consequences of this change (mostly positive, 
I guess). The same metadata principle could certainly be used for 
further optimisations, but this is another story.
I'm motivated to implement this variant, looke like best of both worlds 
tome. (support welcome ;-)

> I think the programmers would be happy to be able to choose.
>
> 4) Letters: Uses either 3a or 3b. This is the layer where the idea of a
> writing system enters the picture: lower/upper case transformations and
> sorting happen at this layer. (I have a library that tries to handle
> this layer but is ignorant of graphemes; I am waiting for spir's string
> type. ;))
>
> 4a) Models RandomAccessRange if based on a RandomAccessGraphemeRange
>
> 4b) Models ForwardRange if based on a ForwardGraphemeRange

I do not understand what this level means. For me, letters are, 
precisely, archetypical true characters, meaning level 3.

[Note: "grapheme", used by Unicode to denote the common sense of 
"character", is simply wrong: "sh" and "ti" are graphemes in english 
(for the same phoneme /ʃ/), not characters; and tab, §, or © are 
probalby not considered graphemes by linguists, while they are 
characters. This is the reason why I try to avoid this term and use 
"character", like ICU's doc, to avoid even more confusion.]

> 5) Text: Collection of Letters. This is where a name like "ali & tim" is
> correctly capitalized as "ALİ & TIM" because the text consists of two
> separate writing systems. (The same library that I mentioned in 4 tries
> to handle this layer as well.)

This is an immensely complicated field. Note that it has nothing to do 
with text & character representation issues: whatever the character set, 
one has to confront problems like uppercase of 'i', 'ss' vs 'ß', 
definiton of "letter" or "character", matching, sorting order...
Text does not even try to address natural language issues. Instead it 
deals onl,y but hopefully clearly & correctly, with restoring simple and 
safe representation for client apps.

> Ali

Denis
_________________
vita es estrany
spir.wikidot.com
January 19, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Tue, 18 Jan 2011 01:11:04 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail@erdani.org> wrote:

> On 1/17/11 11:48 PM, Jonathan M Davis wrote:
>> On Monday 17 January 2011 15:13:42 spir wrote:
>>> See range bug evoked above. opApply is the only workaround AFAIK.
>>> Also, ranges cannot yet provide indexed iteration like
>>> 	foreach(i, char ; text) {...}
>>
>> While it would be nice at times to be able to have an index with  
>> foreach when
>> using ranges, I would point out that it's trivial to just declare a  
>> variable
>> which you increment each iteration, so it's easy to get an index even  
>> when using
>> foreach with ranges. Certainly, I wouldn't consider the lack of index  
>> with
>> foreach and ranges a good reason to use opApply instead of ranges.  
>> There may be
>> other reasons which make it worthwhile, but it's so trivial to get an  
>> index that
>> the loss of range abilities (particularly the ability to use such  
>> ranges with
>> std.algorithm) dwarfs it in importance.
>>
>> - Jonathan M Davis
>
> It's a bit more difficult than that. When iterating a variable-length  
> encoded range, what you need more than the current item being iterated  
> is the physical offset reached inside the range. That's not all that  
> difficult either as the range can always provide an extra primitive, but  
> a bit annoying (e.g. because it makes iteration with foreach impossible  
> if you want the index, unless you return a tuple with each step).
>
> At any rate, I agree with two things - one, we need to fix the foreach  
> situation. Two, even before we find a fix, at this point committing to  
> iteration with opApply essentially commits the iteratee to an island  
> where all basic algorithms need to be reinvented from first principles.

opApply in no way disables the range interface.  It simply is used for  
foreach.  So the only "algorithm" which is different is foreach.  If you  
use the range primitives, opApply is nowhere to be found.

That being said, we have an annoying situation in all this.  opApply  
cannot be used to foreach using indexes *and* ranges are used to foreach  
elements.  If one opApply is found, the compiler gives up on using the  
range functions for foreach (this is reflected in my most recent string_t  
code).  This means you will have to implement a "wrapper" opApply around  
the range primitives in order to also implement indexing foreach.

-Steve
January 23, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/14/2011 09:34 AM, Steven Schveighoffer wrote:
> Is it common to have multiple modifiers on a single character?  The
> problem I see with using decomposed canonical form for strings is that
> we would have to return a dchar[] for each 'element', which severely
> complicates code that, for instance, only expects to handle English.

Hebrew:
• Almost every letter in a printed Hebrew bible has at least one of—
 ‣ vowel marker (the Hebrew alphabet is otherwise consonantal) and
 ‣ a /dagesh/ dot, indicating the difference between /b/ & /v/, or
   between /mm/ and /m/;
• almost every word has at least one letter with a cantillation mark in
 addition to the above; and
• other marks too complicated & off-topic to explain.

Vietnamese uses Latin letters with accents playing multiple roles, so
there are often two or three accent marks on a single letter; e.g., the
name of the creator of pdfTeX is spelled “Hàn Thế Thành”, with two
accents on the “e”.

I’m sure there are others.

—Joel
Next ›   Last »
15 16 17 18 19
Top | Discussion index | About this forum | D home