January 19, 2011
On 1/18/11 7:48 PM, Steven Wawryk wrote:
> On 19/01/11 11:37, Andrei Alexandrescu wrote:
>> On 1/18/11 6:00 PM, Steven Wawryk wrote:
>>> Which is exactly what I asked you about. I understand that you must be
>>> very busy, But how do I get you to look at the actual technical content
>>> of something? Is there something in the way I phrase thing that you
>>> dismiss my introductory motivation without looking into the content?
>>>
>>> I don't mean this as a criticism. I really want to know because I'm
>>> considering a proposal on a different topic but wasn't sure it's worth
>>> it as there seems to be a barrier to getting things considered.
>>
>> One simple fact is that I'm not the only person who needs to look at a
>> design. If you want to propose something for inclusion in Phobos, please
>> put the code in good shape, document it properly, and make a submission
>> in this newsgroup following the Boost model. I get one vote and everyone
>> else gets a vote.
>
> Ok, thanks for this suggestion. But if developing a proposal as concrete
> code is a lot of work that may be rejected, is there a way to sound out
> the idea first before deciding to commit to developing it?

This is the best place as far as I know.

>> Looking back at our exchanges in search for a perceived dismissive
>> attitude on my part (apologies if it seems that way - it was
>> unintentional), I infer your annoyance stems from my answer to this:
>>
>>>>> How does this differ from Steve Schveighoffer's string_t,
>>>>> subtract the indexing and slicing of code-points, plus a
>>>>> bidirectional grapheme range?
>
> No, this was just a summary. Here is the post that you answered
> dismissively: news://news.digitalmars.com:119/ih030g$1ok1$1@digitalmars.com

My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a response. If you found that dismissive, I'd be hard pressed to improve it. To quote myself:

> I believe the proposed scheme:
>
> 1. Changes the language in a major way;
>
> 2. Is highly disruptive;
>
> 3. Improves the status quo in only minor ways.
>
> I'd be much more willing to improve things by e.g. defining the representation() function I talked about a bit ago, and other less disruptive additions.

That took into consideration your amendments.

>  >
>  > In the interest of moving this on, would it become acceptable to you if:
>  >
>  > 1. indexing and slicing of the code-point range were removed?
>  > 2. any additional ranges are exposed to the user according to decisions
>  > made about graphemes, etc?
>  > 3. other constructive criticisms were accommodated?
>  >
>  > Steve
>  >
>  >
>  > On 15/01/11 03:33, Andrei Alexandrescu wrote:
>  >> On 1/14/11 5:06 AM, Steven Schveighoffer wrote:
>  >>> I respectfully disagree. A stream built on fixed-sized units, but with
>  >>> variable length elements, where you can determine the start of an
>  >>> element in O(1) time given a random index absolutely provides
>  >>> random-access. It just doesn't provide length.
>  >>
>  >> I equally respectfully disagree. I think random access is defined as
>  >> accessing the ith element in O(1) time. That's not the case here.
>  >>
>  >> Andrei
>  >
>
>
>> I happen to have discussed at length my beef with Steve's proposal. Now
>> in one sentence you change the proposed design on the fly without
>> fleshing out the consequences, add to it again without substantiation,
>> and presumably expect me to come with a salient analysis of the result.
>> I don't think it's fair to characterize my answer to that as dismissive,
>> nor to pressure me into expanding on it.
>
> Sorry, I could have given more context. But you didn't discuss what I
> asked, based on the observation that your detailed criticisms of Steve's
> proposal all related to a single aspect of it.

I really don't know what to add to make my answer more meaningful.


Andrei
January 19, 2011
On 19/01/11 13:53, Andrei Alexandrescu wrote:
> My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a
> response. If you found that dismissive, I'd be hard pressed to improve
> it. To quote myself:
>
>> I believe the proposed scheme:
>>
>> 1. Changes the language in a major way;
>>
>> 2. Is highly disruptive;
>>
>> 3. Improves the status quo in only minor ways.
>>
>> I'd be much more willing to improve things by e.g. defining the
>> representation() function I talked about a bit ago, and other less
>> disruptive additions.
>
> That took into consideration your amendments.

I don't think that it did.  I proposed no language change, nor anything disruptive.  The change in status quo I proposed was essentially the same one you encouraged here, about a type that gives the user the choice of what kind of range to be operated on.  It appears to me that you were responding to some perception you had about Steve's full proposal (that may have been triggered by something I said in the introduction), not what I actually said in the content.

So, I would still be interested to know how to sound out this newsgroup with an idea (before coding commitment) and have the suggestions considered on something more than a superficial level.

Is the newsgroup too busy?  Should there be people nominated to screen ideas that are worth looking at?  Should I use a completely different approach?  Your suggestions so far I will take into account, but it still looks like there's a barrier to me.


>> Sorry, I could have given more context. But you didn't discuss what I
>> asked, based on the observation that your detailed criticisms of Steve's
>> proposal all related to a single aspect of it.
>
> I really don't know what to add to make my answer more meaningful.
>
>
> Andrei

January 19, 2011
On 1/18/11 9:46 PM, Steven Wawryk wrote:
> On 19/01/11 13:53, Andrei Alexandrescu wrote:
>> My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a
>> response. If you found that dismissive, I'd be hard pressed to improve
>> it. To quote myself:
>>
>>> I believe the proposed scheme:
>>>
>>> 1. Changes the language in a major way;
>>>
>>> 2. Is highly disruptive;
>>>
>>> 3. Improves the status quo in only minor ways.
>>>
>>> I'd be much more willing to improve things by e.g. defining the
>>> representation() function I talked about a bit ago, and other less
>>> disruptive additions.
>>
>> That took into consideration your amendments.
>
> I don't think that it did. I proposed no language change, nor anything
> disruptive.

Adding a new string type would be disruptive. Unless I misunderstood, there is still a new string type in Steve's proposal, and one that would be the default one, even after the amendments you mentioned. That is a problem because people write this:

auto s = "hello";

and the question is, what is the type of s.

 The change in status quo I proposed was essentially the same
> one you encouraged here, about a type that gives the user the choice of
> what kind of range to be operated on. It appears to me that you were
> responding to some perception you had about Steve's full proposal (that
> may have been triggered by something I said in the introduction), not
> what I actually said in the content.

If that's what it is, great. To clarify: no new string type, only a range that iterates one grapheme over existing strings.

> So, I would still be interested to know how to sound out this newsgroup
> with an idea (before coding commitment) and have the suggestions
> considered on something more than a superficial level.
>
> Is the newsgroup too busy? Should there be people nominated to screen
> ideas that are worth looking at? Should I use a completely different
> approach? Your suggestions so far I will take into account, but it still
> looks like there's a barrier to me.

My perception is that you want to minimize risks before starting to invest work into this. I'm not sure how you can do that.

Andrei
January 19, 2011
Michel Fortin wrote:
> On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin@michelf.com>
> said:

> So perhaps the best interface for strings would be to provide multiple
> range-like interfaces that you can use at the level you want.

That's what I've been thinking. The users can choose whether they want random access or not. A grapheme-aware string can provide random access at a space cost, or no random access for efficient space use.

I see 5 layers in string processing. Layers 1 and 2 are currently handled by D, sometimes in an unclear way. e.g. char[] may be used as an array of code units or an array of code points depending on the type of iteration.

1) Code units: This is what D provides with its string types

This layers models RandomAccessRange

2) Code points: This is what D and Phobos provide for example with foreach(d; stride(s, 1))

dchar[] models RandomAccessRange at this layer

char[] and wchar[] model ForwardRange at this layer

(If I understand it correctly, Steven Schveighoffer is trying to provide a pseudo-RandomAccessRange to char[] and wchar[] with his string type.)

3) Graphemes: This is what the string type that spir is working on. There could be at least two types:

3a) RandomAccessGraphemeRange: Has random access but the data type is large

3b) ForwardGraphemeRange: space-efficient but does not provide random access

I think the programmers would be happy to be able to choose.

4) Letters: Uses either 3a or 3b. This is the layer where the idea of a writing system enters the picture: lower/upper case transformations and sorting happen at this layer. (I have a library that tries to handle this layer but is ignorant of graphemes; I am waiting for spir's string type. ;))

4a) Models RandomAccessRange if based on a RandomAccessGraphemeRange

4b) Models ForwardRange if based on a ForwardGraphemeRange

5) Text: Collection of Letters. This is where a name like "ali & tim" is correctly capitalized as "ALİ & TIM" because the text consists of two separate writing systems. (The same library that I mentioned in 4 tries to handle this layer as well.)

Ali
January 19, 2011
On 01/18/2011 06:11 AM, Ali Çehreli wrote:
> Thanks to all that has contributed, I am also following this thread with
> great interest. :)
>
> Michel Fortin wrote:
>  > I mean, a grapheme is a slice of a string, can have multiple code points
>  > (like a string), can be appended the same way as a string, can be
>  > composed or decomposed using canonical normalization or compatibility
>  > normalization (like a string), and should be sorted, uppercased, and
>  > lowercased according to Unicode rules (like a string). Basically, a
>  > grapheme is just a string that happens to contain only one grapheme.
>
> I would like to stress the fact that Unicode knows nothing about
> sorting, uppercasing, or lowercasing.
>
> Those operations are tied to the alphabet (or writing system) that a
> certain grapheme happens to belong to at a given time. For example, we
> cannot uppercase the letter i without knowing what alphabet we are
> dealing with. Two possibilities: I and İ (I dot above).
>
> It is the same issue with sorting.

This is true and false ;-)

You are right, indeed, on the fact that issues like sorting one are language-specific, and more, use-case-specific. The case of the turkish beeing a good example. For another one, in french I do not even know whether there is an official rule! Anyway, whatever the answer, even eg famous newpapers, and official documents, used different rules. Most of them let down accents on uppercase, possibly because of computer limitation; there is a recent move (back) toward accented uppercase.
This is very annoying: "Hélène" has 2 consistent and used uppercase versions. Conversely, how is software supposed to guess the lowercase version of "HELENE"?

Upon Unicode, it still defines norms for casing and so-called collation (compare, for sorting) algorithms. Dunno much more, i have never applied them, personly, for reasons like the ones above. The full list of it's technical docs can be found at http://unicode.org/reports/. See in particular http://unicode.org/reports/tr10/ for collation. (Unfortnately, case mapping is know part of the core standard doc, so that it's hard to get it.)

Denis
_________________
vita es estrany
spir.wikidot.com

January 19, 2011
On 01/19/2011 08:43 AM, Ali Çehreli wrote:
> Michel Fortin wrote:
>  > On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin@michelf.com>
>  > said:
>
>  > So perhaps the best interface for strings would be to provide multiple
>  > range-like interfaces that you can use at the level you want.
>
> That's what I've been thinking. The users can choose whether they want
> random access or not. A grapheme-aware string can provide random access
> at a space cost, or no random access for efficient space use.
>
> I see 5 layers in string processing. Layers 1 and 2 are currently
> handled by D, sometimes in an unclear way. e.g. char[] may be used as an
> array of code units or an array of code points depending on the type of
> iteration.

This is very good and helpful summary. But you do not list all relevant aspects of the question, I guess. Defining which codes belong to a given grapheme (what I call "piling") is necessary for true O(1) random-access, but not only. More importantly, all operations involving equality comparison (find, count, replace,...) require normalisation --in addition to piling.
A few notes:

> 1) Code units: This is what D provides with its string types
>
> This layers models RandomAccessRange

This level is pure implementation artifact that simply cannot make any sense. (from user and thus programmer points of view)
Any kind of text manipulation (slice, find, replace...) may lead to random incorrectness, except when source texts can be guaranteed to hold plain ASCII (which may be hard to prove).
Conversely, pieces of text only passed around by an app do not require any more costly representation, in terms of time (decoding) or space. In addition, concat works provided all pieces share the same encoding (ASCII beeing a subset of most historic charsets and of UTF-8).

> 2) Code points: This is what D and Phobos provide for example with
> foreach(d; stride(s, 1))
>
> dchar[] models RandomAccessRange at this layer
>
> char[] and wchar[] model ForwardRange at this layer
>
> (If I understand it correctly, Steven Schveighoffer is trying to provide
> a pseudo-RandomAccessRange to char[] and wchar[] with his string type.)

This level is also a kind of implementation artifact, compared to historic charsets, but actually based on a real fact of natural languages: they hold composite characters that can thus be coded by combining lower-level codes which represent "scripting marks" (base & combining ones).
For this reason, this level can have some sense. My latest guess is that apps that consider text as a study object (read linguistic apps), instead of a means, may regurarly need operating at this level, in addition to the next one.
Normalisation can be applied at this level --and is necessary for the above kind of use case. But using it for operations requiring compare will typically also require "piling", that is the next level, if only to determine what is to be compared.

> 3) Graphemes: This is what the string type that spir is working on.
> There could be at least two types:

This is the meaningful level for, probably, nearly all applications.

> 3a) RandomAccessGraphemeRange: Has random access but the data type is large

I guess this is Text's approach? Text is "flash fast" indeed for any operation benefiting from random-access. But not only: since it normalises its input, it should be far faster for any operation using compare (rough evaluations suggest a speed ratio of 1 to 2 orders of magnitude).
The cost is high in terms of space, which in turn certainly reduces its speed gain in the general case, because to cache (miss) effects. (Thank you Michel for making this clear.)

> 3b) ForwardGraphemeRange: space-efficient but does not provide random
> access

Is it what Andrei expects, namely a Grapheme type with a corresponding ByGrapheme iterator IIUC?
Time efficiency of operations?

3) metadata RandomAccessGraphemeRange

Michel Fortin suggested (off list) an alternative approach to Text: instead of actually "piling" at construction time, just store metadata upon grapheme bounds. The core benefit is indeed to keep "normal" text storage (meaning *char[], for modification): would this point please Andrei better?
I let you evaluate various consequences of this change (mostly positive, I guess). The same metadata principle could certainly be used for further optimisations, but this is another story.
I'm motivated to implement this variant, looke like best of both worlds tome. (support welcome ;-)

> I think the programmers would be happy to be able to choose.
>
> 4) Letters: Uses either 3a or 3b. This is the layer where the idea of a
> writing system enters the picture: lower/upper case transformations and
> sorting happen at this layer. (I have a library that tries to handle
> this layer but is ignorant of graphemes; I am waiting for spir's string
> type. ;))
>
> 4a) Models RandomAccessRange if based on a RandomAccessGraphemeRange
>
> 4b) Models ForwardRange if based on a ForwardGraphemeRange

I do not understand what this level means. For me, letters are, precisely, archetypical true characters, meaning level 3.

[Note: "grapheme", used by Unicode to denote the common sense of "character", is simply wrong: "sh" and "ti" are graphemes in english (for the same phoneme /ʃ/), not characters; and tab, §, or © are probalby not considered graphemes by linguists, while they are characters. This is the reason why I try to avoid this term and use "character", like ICU's doc, to avoid even more confusion.]

> 5) Text: Collection of Letters. This is where a name like "ali & tim" is
> correctly capitalized as "ALİ & TIM" because the text consists of two
> separate writing systems. (The same library that I mentioned in 4 tries
> to handle this layer as well.)

This is an immensely complicated field. Note that it has nothing to do with text & character representation issues: whatever the character set, one has to confront problems like uppercase of 'i', 'ss' vs 'ß', definiton of "letter" or "character", matching, sorting order...
Text does not even try to address natural language issues. Instead it deals onl,y but hopefully clearly & correctly, with restoring simple and safe representation for client apps.

> Ali

Denis
_________________
vita es estrany
spir.wikidot.com

January 19, 2011
On Tue, 18 Jan 2011 01:11:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 1/17/11 11:48 PM, Jonathan M Davis wrote:
>> On Monday 17 January 2011 15:13:42 spir wrote:
>>> See range bug evoked above. opApply is the only workaround AFAIK.
>>> Also, ranges cannot yet provide indexed iteration like
>>> 	foreach(i, char ; text) {...}
>>
>> While it would be nice at times to be able to have an index with foreach when
>> using ranges, I would point out that it's trivial to just declare a variable
>> which you increment each iteration, so it's easy to get an index even when using
>> foreach with ranges. Certainly, I wouldn't consider the lack of index with
>> foreach and ranges a good reason to use opApply instead of ranges. There may be
>> other reasons which make it worthwhile, but it's so trivial to get an index that
>> the loss of range abilities (particularly the ability to use such ranges with
>> std.algorithm) dwarfs it in importance.
>>
>> - Jonathan M Davis
>
> It's a bit more difficult than that. When iterating a variable-length encoded range, what you need more than the current item being iterated is the physical offset reached inside the range. That's not all that difficult either as the range can always provide an extra primitive, but a bit annoying (e.g. because it makes iteration with foreach impossible if you want the index, unless you return a tuple with each step).
>
> At any rate, I agree with two things - one, we need to fix the foreach situation. Two, even before we find a fix, at this point committing to iteration with opApply essentially commits the iteratee to an island where all basic algorithms need to be reinvented from first principles.

opApply in no way disables the range interface.  It simply is used for foreach.  So the only "algorithm" which is different is foreach.  If you use the range primitives, opApply is nowhere to be found.

That being said, we have an annoying situation in all this.  opApply cannot be used to foreach using indexes *and* ranges are used to foreach elements.  If one opApply is found, the compiler gives up on using the range functions for foreach (this is reflected in my most recent string_t code).  This means you will have to implement a "wrapper" opApply around the range primitives in order to also implement indexing foreach.

-Steve
January 23, 2011
On 01/14/2011 09:34 AM, Steven Schveighoffer wrote:
> Is it common to have multiple modifiers on a single character?  The problem I see with using decomposed canonical form for strings is that we would have to return a dchar[] for each 'element', which severely complicates code that, for instance, only expects to handle English.

Hebrew:
• Almost every letter in a printed Hebrew bible has at least one of—
  ‣ vowel marker (the Hebrew alphabet is otherwise consonantal) and
  ‣ a /dagesh/ dot, indicating the difference between /b/ & /v/, or
    between /mm/ and /m/;
• almost every word has at least one letter with a cantillation mark in
  addition to the above; and
• other marks too complicated & off-topic to explain.

Vietnamese uses Latin letters with accents playing multiple roles, so there are often two or three accent marks on a single letter; e.g., the name of the creator of pdfTeX is spelled “Hàn Thế Thành”, with two accents on the “e”.

I’m sure there are others.

—Joel
9 10 11 12 13 14 15 16 17 18 19
Next ›   Last »