View mode: basic / threaded / horizontal-split · Log in · Help
January 12, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/12/2011 08:28 PM, Don wrote:
> I think the only problem that we really have, is that "char[]",
> "dchar[]" implies that code points is always the appropriate level of
> abstraction.

I'd like to know when it happens that codepoint is the appropriate level 
of abstraction.
* If pieces of text are not manipulated, meaning just used in the 
application, or just transferred via the application as is (from file / 
input / literal to any kind of output), then any kind of encoding just 
works. One can even concatenate, provided all pieces use the same 
encoding. --> _lower_ level than codepoint is OK.
* But any of manipulation (indexing, slicing, compare, search, count, 
replace, not to speak about regex/parsing) requires operating at the 
_higher_ level of characters (in the common sense). Just like with 
historic character sets in which codes used to represent characters (not 
lower-level thingies as in UCS). Else, one reads, compares, changes 
meaningless bits of text.

As I see it now, we need 2 types:
* One plain string similar to good old ones (bytestring would do the 
job, since most unicode is utf8 encoded) for the first kind of use 
above. With optional validity check when it's supposed to be unicode text.
* One hiher-level type abstracting from codepoint (not code unit) 
issues, restoring the necessary properties: (1) each character is one 
element in the sequence (2) each character is always represented the 
same way.


Denis
_________________
vita es estrany
spir.wikidot.com
January 12, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
spir wrote:
> On 01/12/2011 08:28 PM, Don wrote:
>> I think the only problem that we really have, is that "char[]",
>> "dchar[]" implies that code points is always the appropriate level of
>> abstraction.
>
> I'd like to know when it happens that codepoint is the appropriate level
> of abstraction.

When on a document that describes code points... :)

> * If pieces of text are not manipulated, meaning just used in the
> application, or just transferred via the application as is (from file /
> input / literal to any kind of output), then any kind of encoding just
> works. One can even concatenate, provided all pieces use the same
> encoding. --> _lower_ level than codepoint is OK.
> * But any of manipulation (indexing, slicing, compare,

Compare according to which alphabet's ordering? Surely not Unicode's... 
I may be alone in this, but ordering is tied to an alphabet (or writing 
system), not locale.)

I try to solve that issue with my trileri library:

  http://code.google.com/p/trileri/source/browse/#svn%2Ftrunk%2Ftr

Warning: the code is in Turkish and is not aware of the concept of 
collation at all; it has its own simplistic view of text, where every 
character is an entity that can be lower/upper cased to a single character.

> search, count,
> replace, not to speak about regex/parsing) requires operating at the
> _higher_ level of characters (in the common sense).

I don't know this about Unicode: should e and ´ (acute accent) be always 
collated? If so, wouldn't it be impossible to put those two in that 
order say, in a text book? (Perhaps Unicode defines a way to stop 
collation.)

> Just like with
> historic character sets in which codes used to represent characters (not
> lower-level thingies as in UCS). Else, one reads, compares, changes
> meaningless bits of text.
>
> As I see it now, we need 2 types:

I think we need more than 2 types...

> * One plain string similar to good old ones (bytestring would do the
> job, since most unicode is utf8 encoded) for the first kind of use
> above. With optional validity check when it's supposed to be unicode 
text.

Agreed. D gives us three UTF encondings, but I am not sure that there is 
only one abstraction above that.

> * One hiher-level type abstracting from codepoint (not code unit)
> issues, restoring the necessary properties: (1) each character is one
> element in the sequence (2) each character is always represented the
> same way.

I think VLERange should solve only the variable-length-encoding issue. 
It should not get into higher abstractions.

Ali
January 13, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-12 14:57:58 -0500, spir <denis.spir@gmail.com> said:

> On 01/12/2011 08:28 PM, Don wrote:
>> I think the only problem that we really have, is that "char[]",
>> "dchar[]" implies that code points is always the appropriate level of
>> abstraction.
> 
> I'd like to know when it happens that codepoint is the appropriate 
> level of abstraction.

I agree with you. I don't see many use for code points.

One of these uses is writing a parser for a format defined in term of 
code points (XML for instance). But beyond that, I don't see one.


> * If pieces of text are not manipulated, meaning just used in the 
> application, or just transferred via the application as is (from file / 
> input / literal to any kind of output), then any kind of encoding just 
> works. One can even concatenate, provided all pieces use the same 
> encoding. --> _lower_ level than codepoint is OK.
> * But any of manipulation (indexing, slicing, compare, search, count, 
> replace, not to speak about regex/parsing) requires operating at the 
> _higher_ level of characters (in the common sense). Just like with 
> historic character sets in which codes used to represent characters 
> (not lower-level thingies as in UCS). Else, one reads, compares, 
> changes meaningless bits of text.

Very true. In the same way that code points can span on multiple code 
units, user-perceived characters (graphemes) can span on multiple code 
points.

A funny exercise to make a fool of an algorithm working only with code 
points would be to replace the word "fortune" in a text containing the 
word "fortuné". If the last "é" is expressed as two code points, as "e" 
followed by a combining acute accent (this: é), replacing occurrences 
of "fortune" by "expose" would also replace "fortuné" with "exposé" 
because the combining acute accent remains as the code point following 
the word. Quite amusing, but it doesn't really make sense that it works 
like that.

In the case of "é", we're lucky enough to also have a pre-combined 
character to encode it as a single code point, so encountering "é" 
written as two code points is quite rare. But not all combinations of 
marks and characters can be represented as a single code point. The 
correct thing to do is to treat "é" (single code point) and "é" ("e" + 
combining acute accent) as equivalent.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 13, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-12 19:45:36 -0500, Michel Fortin <michel.fortin@michelf.com> said:

> A funny exercise to make a fool of an algorithm working only with code 
> points would be to replace the word "fortune" in a text containing the 
> word "fortuné". If the last "é" is expressed as two code points, as "e" 
> followed by a combining acute accent (this: é), replacing occurrences 
> of "fortune" by "expose" would also replace "fortuné" with "exposé" 
> because the combining acute accent remains as the code point following 
> the word. Quite amusing, but it doesn't really make sense that it works 
> like that.
> 
> In the case of "é", we're lucky enough to also have a pre-combined 
> character to encode it as a single code point, so encountering "é" 
> written as two code points is quite rare. But not all combinations of 
> marks and characters can be represented as a single code point. The 
> correct thing to do is to treat "é" (single code point) and "é" ("e" + 
> combining acute accent) as equivalent.

Crap, I meant to send this as UTF-8 with combining characters in it, 
but my news client converted everything to ISO-8859-1.

I'm not sure it'll work, but here's my second attempt at posting real 
combining marks:

	Single code point: é
	e with combining mark: é
	t with combining mark: t̂
	t with two combining marks: t̂̃

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 13, 2011
Unicode's proper level of abstraction? [was: Re: VLERange:...]
On 01/13/2011 01:45 AM, Michel Fortin wrote:
> On 2011-01-12 14:57:58 -0500, spir <denis.spir@gmail.com> said:
>
>> On 01/12/2011 08:28 PM, Don wrote:
>>> I think the only problem that we really have, is that "char[]",
>>> "dchar[]" implies that code points is always the appropriate level of
>>> abstraction.
>>
>> I'd like to know when it happens that codepoint is the appropriate
>> level of abstraction.
>
> I agree with you. I don't see many use for code points.
>
> One of these uses is writing a parser for a format defined in term of
> code points (XML for instance). But beyond that, I don't see one.

Actually, I had once a real use case for codepoint beeing the proper 
level of abstraction: a linguistic app of which one operational func 
counts occurrences of "scripting marks" like 'a' & '¨' in "ä". hope you 
see what I mean.
Once the text is properly NFD decomposed, each of those marks in coded 
as a codepoint. (But if it's not decomposed, then most of those marks 
are probably hidden by precomposed codes coding characters like "ä".) So 
that even such an app benefits from a higher-level type basically 
operating on normalised (NFD) characters.

>> * If pieces of text are not manipulated, meaning just used in the
>> application, or just transferred via the application as is (from file
>> / input / literal to any kind of output), then any kind of encoding
>> just works. One can even concatenate, provided all pieces use the same
>> encoding. --> _lower_ level than codepoint is OK.
>> * But any of manipulation (indexing, slicing, compare, search, count,
>> replace, not to speak about regex/parsing) requires operating at the
>> _higher_ level of characters (in the common sense). Just like with
>> historic character sets in which codes used to represent characters
>> (not lower-level thingies as in UCS). Else, one reads, compares,
>> changes meaningless bits of text.
>
> Very true. In the same way that code points can span on multiple code
> units, user-perceived characters (graphemes) can span on multiple code
> points.
>
> A funny exercise to make a fool of an algorithm working only with code
> points would be to replace the word "fortune" in a text containing the
> word "fortuné". If the last "é" is expressed as two code points, as "e"
> followed by a combining acute accent (this: é), replacing occurrences of
> "fortune" by "expose" would also replace "fortuné" with "exposé" because
> the combining acute accent remains as the code point following the word.
> Quite amusing, but it doesn't really make sense that it works like that.
>
> In the case of "é", we're lucky enough to also have a pre-combined
> character to encode it as a single code point, so encountering "é"
> written as two code points is quite rare. But not all combinations of
> marks and characters can be represented as a single code point. The
> correct thing to do is to treat "é" (single code point) and "é" ("e" +
> combining acute accent) as equivalent.

You'll find another example in the introduction of the text at 
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level%20of%20abstraction

About your last remark, this is precisely one of the two abstractions my 
Text type provides: it groups togeter in "piles" codes that belong to 
the same "true" character (grapheme) like "é". So that the resulting 
text representation is a sequence of "piles", each representing a 
character. Consequence: indexing, slicing, etc work sensibly (and even 
other operations are faster for they do not need to perform that 
"piling" again & again).
In addition to that, the string is first NFD-normalised, thus each 
chraracter can have one & only representation. Consequence: search, 
count, replace, etc, and compare (*) work as expected. In your case:
    // 2 forms of "é"
    assert(Text("\u00E9") == Text("\u0065\u0301"));

Denis

(*) According to UCS coding, not language-specific idiosyncrasies.
More generally, Text abstract from lower-level issues _introduced_ by 
UCS, Unicode's character set. It does not code with script-, language-, 
culture-, domain-, app- specific needs such as custom text sorting 
rules. Some base routines for such operations are provided by Text's 
brother lib DUnicode (access to some code properties, safe concat, 
casefolded compare, NF* normalisation).
_________________
vita es estrany
spir.wikidot.com
January 13, 2011
Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]
On Thursday 13 January 2011 01:49:31 spir wrote:
> On 01/13/2011 01:45 AM, Michel Fortin wrote:
> > On 2011-01-12 14:57:58 -0500, spir <denis.spir@gmail.com> said:
> >> On 01/12/2011 08:28 PM, Don wrote:
> >>> I think the only problem that we really have, is that "char[]",
> >>> "dchar[]" implies that code points is always the appropriate level of
> >>> abstraction.
> >> 
> >> I'd like to know when it happens that codepoint is the appropriate
> >> level of abstraction.
> > 
> > I agree with you. I don't see many use for code points.
> > 
> > One of these uses is writing a parser for a format defined in term of
> > code points (XML for instance). But beyond that, I don't see one.
> 
> Actually, I had once a real use case for codepoint beeing the proper
> level of abstraction: a linguistic app of which one operational func
> counts occurrences of "scripting marks" like 'a' & '¨' in "ä". hope you
> see what I mean.
> Once the text is properly NFD decomposed, each of those marks in coded
> as a codepoint. (But if it's not decomposed, then most of those marks
> are probably hidden by precomposed codes coding characters like "ä".) So
> that even such an app benefits from a higher-level type basically
> operating on normalised (NFD) characters.

There's also the question of efficiency. On the whole, string operations can be 
very expensive - particularly when you're doing a lot of them. The fact that D's 
arrays are so powerful may reduce the problem in D, but in general, if you're 
doing a lot with strings, it can get costly, performance-wise.

The question then is what is the cost of actually having strings abstracted to 
the point that they really are ranges of characters rather than code units or 
code points or whatever? If the cost is large enough, then dealing with strings 
as arrays as they currently are and having the occasional unicode issue could 
very well be worth it. As it is, there are plenty of people who don't want to 
have to care about unicode in the first place, since the programs that they write 
only deal with ASCII characters. The fact that D makes it so easy to deal with 
unicode code points is a definite improvement, but taking the abstraction to the 
point that you're definitely dealing with characters rather than code units or 
code points could be too costly.

Now, if it can be done efficiently, then having unicode dealt with properly 
without the programmer having to worry about it would be a big boon. As it is, 
D's handling of unicode is a big boon, even if it doesn't deal with graphemes 
and the like.

So, I think that we definitely should have an abstraction for unicode which uses 
characters as the elements in the range and doesn't have to care about the 
underlying encoding of the characters (except perhaps picking whether char, 
wchar, or dchar is use internally, and therefore how much space it requires). 
However, I'm not at all convinced that such an abstraction can be done efficiently 
enough to make it the default way of handling strings.

- Jonathan M Davis
January 13, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/13/2011 01:51 AM, Michel Fortin wrote:
> On 2011-01-12 19:45:36 -0500, Michel Fortin <michel.fortin@michelf.com>
> said:
>
>> A funny exercise to make a fool of an algorithm working only with code
>> points would be to replace the word "fortune" in a text containing the
>> word "fortuné". If the last "é" is expressed as two code points, as
>> "e" followed by a combining acute accent (this: é), replacing
>> occurrences of "fortune" by "expose" would also replace "fortuné" with
>> "exposé" because the combining acute accent remains as the code point
>> following the word. Quite amusing, but it doesn't really make sense
>> that it works like that.
>>
>> In the case of "é", we're lucky enough to also have a pre-combined
>> character to encode it as a single code point, so encountering "é"
>> written as two code points is quite rare. But not all combinations of
>> marks and characters can be represented as a single code point. The
>> correct thing to do is to treat "é" (single code point) and "é" ("e" +
>> combining acute accent) as equivalent.
>
> Crap, I meant to send this as UTF-8 with combining characters in it, but
> my news client converted everything to ISO-8859-1.
>
> I'm not sure it'll work, but here's my second attempt at posting real
> combining marks:
>
> Single code point: é
> e with combining mark: é
> t with combining mark: t̂
> t with two combining marks: t̂̃

Works :-) But your first post worked as well by me: for instance <<"é" 
("e" + combining acute accent)>> was displayed "é" as a single accented 
letter. I guess maybe your email client did not convert into iso-8859-1 
on sending, but on reading (mine is set for utf-8).

Denis
_________________
vita es estrany
spir.wikidot.com
January 13, 2011
Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]
On 01/13/2011 11:16 AM, Jonathan M Davis wrote:
> On Thursday 13 January 2011 01:49:31 spir wrote:
>> On 01/13/2011 01:45 AM, Michel Fortin wrote:
>>> On 2011-01-12 14:57:58 -0500, spir<denis.spir@gmail.com>  said:
>>>> On 01/12/2011 08:28 PM, Don wrote:
>>>>> I think the only problem that we really have, is that "char[]",
>>>>> "dchar[]" implies that code points is always the appropriate level of
>>>>> abstraction.
>>>>
>>>> I'd like to know when it happens that codepoint is the appropriate
>>>> level of abstraction.
>>>
>>> I agree with you. I don't see many use for code points.
>>>
>>> One of these uses is writing a parser for a format defined in term of
>>> code points (XML for instance). But beyond that, I don't see one.
>>
>> Actually, I had once a real use case for codepoint beeing the proper
>> level of abstraction: a linguistic app of which one operational func
>> counts occurrences of "scripting marks" like 'a'&  '¨' in "ä". hope you
>> see what I mean.
>> Once the text is properly NFD decomposed, each of those marks in coded
>> as a codepoint. (But if it's not decomposed, then most of those marks
>> are probably hidden by precomposed codes coding characters like "ä".) So
>> that even such an app benefits from a higher-level type basically
>> operating on normalised (NFD) characters.
>
> There's also the question of efficiency. On the whole, string operations can be
> very expensive - particularly when you're doing a lot of them. The fact that D's
> arrays are so powerful may reduce the problem in D, but in general, if you're
> doing a lot with strings, it can get costly, performance-wise.

D's arrays (even dchar[] & dstring) do not allow having correct results 
when dealing with UCS/Unicode text in the general case. See Michel's 
example (and several ones I posted on this list, and the text at 
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level%20of%20abstraction 
for a very lengthy explanation).
You and some other people seem to still mistake Unicode's low level 
issue of codepoint vs code unit, with the higher-level issue of codes 
_not_ representing characters in the commmon sense ("graphemes").

The above pointed text was written precisely to introduce to this issue 
because obviously no-one wants to face it... (Eg each time I evoke it on 
this list it is ignored, except by Michel, but the same is true 
everywhere else, including on the Unicode mailing list!). The core of 
the problem is the misleading term "abstract character" which 
deceivingly lets programmers believe that a codepoints codes a 
character, like in historic character sets -- which is *wrong*. No 
Unicode document AFAIK explains this. This is a case of unsaid lie.
Compared to legacy charsets, dealing with Unicode actually requires *2* 
levels of abstraction... (one to decode codepoints from code units, one 
to construct characters from codepoints)

Note that D's stdlib currently provides no means to do this, not even on 
the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode 
library) (good luck ;-). But even ICU, as well as supposed unicode-aware 
typse or librarys for any language, would give you an abstraction 
producing correct results for Michel's example. For instance, Python3 
code fails as miserably as any other. AFAIK, D is the first and only 
language having such a tool (Text.d at 
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).

> The question then is what is the cost of actually having strings abstracted to
> the point that they really are ranges of characters rather than code units or
> code points or whatever? If the cost is large enough, then dealing with strings
> as arrays as they currently are and having the occasional unicode issue could
> very well be worth it. As it is, there are plenty of people who don't want to
> have to care about unicode in the first place, since the programs that they write
> only deal with ASCII characters. The fact that D makes it so easy to deal with
> unicode code points is a definite improvement, but taking the abstraction to the
> point that you're definitely dealing with characters rather than code units or
> code points could be too costly.

When _manipulating_ text (indexing, search, changing), you have the 
choice between:
* On the fly abstraction (composing characters on the fly, and/or 
normalising them), for each operation for each piece of text (including 
parameters, including literals).
* Use of a type that constructs this abstraction once only for each 
piece of text.
Note that a single count operation is forced to construct this 
abstraction on the fly for the whole text... (and for the searched snippet).
Also note that optimisation is probably easier is the second case, for 
the abstraction operation is then standard.

> Now, if it can be done efficiently, then having unicode dealt with properly
> without the programmer having to worry about it would be a big boon. As it is,
> D's handling of unicode is a big boon, even if it doesn't deal with graphemes
> and the like.

It has a cost at intial Text construction time. Currently, on my very 
slow computer, 1MB source text requires ~ 500 ms (decoding + 
decomposition + ordering + "piling" codes into characters). Decoding 
only using D's builtin std.utf.decode takes about 100 ms.
The bottle neck is piling: 70% of the time in average, on a test case 
melting texts from a dozen natural languages. We would be very glad to 
get the community's help in optimising this phase :-)
(We have progressed very much already in terms of speed, but now reach 
limits of our competences.)

> So, I think that we definitely should have an abstraction for unicode which uses
> characters as the elements in the range and doesn't have to care about the
> underlying encoding of the characters (except perhaps picking whether char,
> wchar, or dchar is use internally, and therefore how much space it requires).
> However, I'm not at all convinced that such an abstraction can be done efficiently
> enough to make it the default way of handling strings.

If you only have ASCII, or if you don't manipulate text at all, then as 
said in a previous post any string representation works fine (whatever 
the encoding it possibly uses under the hood).
D's builtin char/dchar/wchar and string/dstring/wstring are very nice 
and well done, but they are not necessary in such a use case. Actually, 
as shown by Steven's repeted complaints, they rather get in the way when 
dealing with non-unicode source data (IIUC, by assuming string elements 
are utf codes).

And they do not even try to solve the real issues one necessarily meets 
when manipulating unicode texts, which are due to UCS's coding format. 
Thus my previous statement: the level of codepoints is nearly never the 
proper level of abstraction.

> - Jonathan M Davis

Denis
_________________
vita es estrany
spir.wikidot.com
January 13, 2011
Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]
On Thursday 13 January 2011 03:48:46 spir wrote:
> On 01/13/2011 11:16 AM, Jonathan M Davis wrote:
> > On Thursday 13 January 2011 01:49:31 spir wrote:
> >> On 01/13/2011 01:45 AM, Michel Fortin wrote:
> >>> On 2011-01-12 14:57:58 -0500, spir<denis.spir@gmail.com>  said:
> >>>> On 01/12/2011 08:28 PM, Don wrote:
> >>>>> I think the only problem that we really have, is that "char[]",
> >>>>> "dchar[]" implies that code points is always the appropriate level of
> >>>>> abstraction.
> >>>> 
> >>>> I'd like to know when it happens that codepoint is the appropriate
> >>>> level of abstraction.
> >>> 
> >>> I agree with you. I don't see many use for code points.
> >>> 
> >>> One of these uses is writing a parser for a format defined in term of
> >>> code points (XML for instance). But beyond that, I don't see one.
> >> 
> >> Actually, I had once a real use case for codepoint beeing the proper
> >> level of abstraction: a linguistic app of which one operational func
> >> counts occurrences of "scripting marks" like 'a'&  '¨' in "ä". hope you
> >> see what I mean.
> >> Once the text is properly NFD decomposed, each of those marks in coded
> >> as a codepoint. (But if it's not decomposed, then most of those marks
> >> are probably hidden by precomposed codes coding characters like "ä".) So
> >> that even such an app benefits from a higher-level type basically
> >> operating on normalised (NFD) characters.
> > 
> > There's also the question of efficiency. On the whole, string operations
> > can be very expensive - particularly when you're doing a lot of them.
> > The fact that D's arrays are so powerful may reduce the problem in D,
> > but in general, if you're doing a lot with strings, it can get costly,
> > performance-wise.
> 
> D's arrays (even dchar[] & dstring) do not allow having correct results
> when dealing with UCS/Unicode text in the general case. See Michel's
> example (and several ones I posted on this list, and the text at
> https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20le
> vel%20of%20abstraction for a very lengthy explanation).
> You and some other people seem to still mistake Unicode's low level
> issue of codepoint vs code unit, with the higher-level issue of codes
> _not_ representing characters in the commmon sense ("graphemes").
> 
> The above pointed text was written precisely to introduce to this issue
> because obviously no-one wants to face it... (Eg each time I evoke it on
> this list it is ignored, except by Michel, but the same is true
> everywhere else, including on the Unicode mailing list!). The core of
> the problem is the misleading term "abstract character" which
> deceivingly lets programmers believe that a codepoints codes a
> character, like in historic character sets -- which is *wrong*. No
> Unicode document AFAIK explains this. This is a case of unsaid lie.
> Compared to legacy charsets, dealing with Unicode actually requires *2*
> levels of abstraction... (one to decode codepoints from code units, one
> to construct characters from codepoints)
> 
> Note that D's stdlib currently provides no means to do this, not even on
> the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode
> library) (good luck ;-). But even ICU, as well as supposed unicode-aware
> typse or librarys for any language, would give you an abstraction
> producing correct results for Michel's example. For instance, Python3
> code fails as miserably as any other. AFAIK, D is the first and only
> language having such a tool (Text.d at
> https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).
> 
> > The question then is what is the cost of actually having strings
> > abstracted to the point that they really are ranges of characters rather
> > than code units or code points or whatever? If the cost is large enough,
> > then dealing with strings as arrays as they currently are and having the
> > occasional unicode issue could very well be worth it. As it is, there
> > are plenty of people who don't want to have to care about unicode in the
> > first place, since the programs that they write only deal with ASCII
> > characters. The fact that D makes it so easy to deal with unicode code
> > points is a definite improvement, but taking the abstraction to the
> > point that you're definitely dealing with characters rather than code
> > units or code points could be too costly.
> 
> When _manipulating_ text (indexing, search, changing), you have the
> choice between:
> * On the fly abstraction (composing characters on the fly, and/or
> normalising them), for each operation for each piece of text (including
> parameters, including literals).
> * Use of a type that constructs this abstraction once only for each
> piece of text.
> Note that a single count operation is forced to construct this
> abstraction on the fly for the whole text... (and for the searched
> snippet). Also note that optimisation is probably easier is the second
> case, for the abstraction operation is then standard.
> 
> > Now, if it can be done efficiently, then having unicode dealt with
> > properly without the programmer having to worry about it would be a big
> > boon. As it is, D's handling of unicode is a big boon, even if it
> > doesn't deal with graphemes and the like.
> 
> It has a cost at intial Text construction time. Currently, on my very
> slow computer, 1MB source text requires ~ 500 ms (decoding +
> decomposition + ordering + "piling" codes into characters). Decoding
> only using D's builtin std.utf.decode takes about 100 ms.
> The bottle neck is piling: 70% of the time in average, on a test case
> melting texts from a dozen natural languages. We would be very glad to
> get the community's help in optimising this phase :-)
> (We have progressed very much already in terms of speed, but now reach
> limits of our competences.)
> 
> > So, I think that we definitely should have an abstraction for unicode
> > which uses characters as the elements in the range and doesn't have to
> > care about the underlying encoding of the characters (except perhaps
> > picking whether char, wchar, or dchar is use internally, and therefore
> > how much space it requires). However, I'm not at all convinced that such
> > an abstraction can be done efficiently enough to make it the default way
> > of handling strings.
> 
> If you only have ASCII, or if you don't manipulate text at all, then as
> said in a previous post any string representation works fine (whatever
> the encoding it possibly uses under the hood).
> D's builtin char/dchar/wchar and string/dstring/wstring are very nice
> and well done, but they are not necessary in such a use case. Actually,
> as shown by Steven's repeted complaints, they rather get in the way when
> dealing with non-unicode source data (IIUC, by assuming string elements
> are utf codes).
> 
> And they do not even try to solve the real issues one necessarily meets
> when manipulating unicode texts, which are due to UCS's coding format.
> Thus my previous statement: the level of codepoints is nearly never the
> proper level of abstraction.

I wasn't saying that code points are guaranteed to be characters. I was saying 
that in most cases they are, so if efficiency is an issue, then having properly 
abstract characters could be too costly. However, having a range type which 
properly abstracts characters and deals with whatever graphemes and 
normalization and whatnot that it has to would be a very good thing to have. The 
real question is whether it can be made efficient enough to even consider using it 
normally instead of just when you know that you're really going to need it.

The fact that you're seeing such a large drop in performance with your Text type 
definitely would support the idea that it could be just plain too expensive to 
use such a type in the average case. Even something like a 20% drop in 
performance could be devastating if you're dealing with code which does a lot of 
string processing. Regardless though, there will obviously be cases where you'll 
need something like your Text type if you want to process unicode correctly.

However, regardless of what the best way to handle unicode is in general, I 
think that it's painfully clear that your average programmer doesn't know much 
about unicode. Even understanding the nuances between char, wchar, and dchar is 
more than your average programmer seems to understand at first. The idea that a 
char wouldn't be guaranteed to be an actual character is not something that many 
programmers take to immediately. It's quite foreign to how chars are typically 
dealt with in other languages, and many programmers never worry about unicode at 
all, only dealing with ASCII. So, not only is unicode a rather disgusting 
problem, but it's not one that your average programmer begins to grasp as far as 
I've seen. Unless the issue is abstracted away completely, it takes a fair bit 
of explaining to understand how to deal with unicoder properly.

- Jonathan M Davis
January 13, 2011
Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]
On 01/13/2011 01:10 PM, Jonathan M Davis wrote:
> I wasn't saying that code points are guaranteed to be characters. I was saying
> that in most cases they are, so if efficiency is an issue, then having properly
> abstract characters could be too costly.

The problem is then: how does a library or application programmer know, 
for sure, that all true characters (graphemes) from all source texts its 
software will ever deal with are coded with a single codepoint?
If you cope with ASCII only now & forever, then you know that.
If you do not manipulate text at all, then the question vanishes.

Else, you cannot know, I guess. The problem is partially masked because, 
most of us currently process only western language sources, for which 
scripts there exist precomposed codes for every _predefine_ character, 
and text-producing software (like editors) usually use precomposed codes 
when available. Hope I'm clear.
(I hope this use of precomposed codes will change because the gain in 
space for western langs is ridiculous and the cost in processing is 
instead relevant.)
In the future, all of this may change, so that the issue would more 
often be obvious for many programmers dealing with international text. 
Note that even now nothing prevents a user (including a programmer in 
source code!), even less a text-producing software, to use decomposed 
coding (the right choice imo). And there are true characters, and you 
can "invent" as many fancy characters you like, for which no precomposed 
code is defined, indeed. All of this is valid unicode and must be 
properly dealt with.

> However, having a range type which
> properly abstracts characters and deals with whatever graphemes and
> normalization and whatnot that it has to would be a very good thing 
to have. The real question is whether it can be made efficient enough to 
even consider using it normally instead of just when you know that 
you're really going to need it.

Upon range, we initially planned to expose a range interface in our type 
for iteration, instead of opApply, for better integration with coming D2 
style, and algorithms. But had to let it down due to a few range bugs 
exposed in a previous thread (search for "range usability" IIRC).

> The fact that you're seeing such a large drop in performance with your Text type
> definitely would support the idea that it could be just plain too expensive to
> use such a type in the average case. Even something like a 20% drop in
> performance could be devastating if you're dealing with code which does a lot of
> string processing. Regardless though, there will obviously be cases where you'll
> need something like your Text type if you want to process unicode correctly.

The question of efficency is not as you present it. If you cannot 
guarantee that every character is coded by a single code (in all pieces 
of text, including params and literal), then you *must* construct an 
abstraction at the level of true characters --and even probably 
normalise them.
You have the choice of doing it on the fly for _every_ operation, or 
using a tool like the type Text. In the latter case, not only everything 
is far simpler for client code, but the abstraction is constructed only 
once (and forever ;-).

In the first case, the cost is the same (or rather higher because 
optimisation can probably be more efficient for a single standard case 
than for various operation cases); but _multiplied_ by the number of 
operations you need to perform on each piece of text. Thus, for a given 
operation, you get the slowest possible run: for instance indexing is 
O(k*n) where k is the cost of "piling" a single char, and n the char 
count...

In the second case, the efficiency issue happens only initially for each 
piece of text. Then, every operation is as fast as possible: indexing is 
indeed O(1).
But: this O(1) is slightly slower than with historic charsets because 
characters are now represented by mini code arrays instead of single 
codes. The same point applies even more for every operation involving 
compares (search, count, replace). We cannot solve this: it is due to 
UCS's coding scheme.

> However, regardless of what the best way to handle unicode is in general, I
> think that it's painfully clear that your average programmer doesn't know much
> about unicode.

True. Even those who think they are informed. Because Unicode's docs all 
not only ignore the problem, but contribute to creating it by using the 
deceiving term "abstract character" (and often worse, "character" alone) 
to denote what a codepoint codes. All articles I have ever read _about_ 
Unicode by third party simply follow. Evoking this issue on the unicode 
mailing list usually results in plain silence.

> Even understanding the nuances between char, wchar, and dchar is
> more than your average programmer seems to understand at first. The idea that a
> char wouldn't be guaranteed to be an actual character is not something that many
> programmers take to immediately. It's quite foreign to how chars are typically
> dealt with in other languages, and many programmers never worry about unicode at
> all, only dealing with ASCII.

(average programmer ? ;-)
Not that much to "how chars are typically dealt with in other 
languages", rather to how characters were coded in historic charsets. 
Other languages ignore the issue, and thus run incorrectly with 
universal text, the same way as D's builtin tools do it.
About ASCII, note that the only kind of source it's able to encode is 
plain english text, without any bit of fancy thingy in it. A single 
non-breaking space, "≥", "×" (product U+00D7), or using a letter 
imported from foreign language like in "à la", same for "αβγ", not to 
evoke "©" & "®"...

> So, not only is unicode a rather disgusting
> problem, but it's not one that your average programmer begins to grasp as far as
> I've seen. Unless the issue is abstracted away completely, it takes a fair bit
> of explaining to understand how to deal with unicoder properly.

Please have a look at 
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3, read 
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level%20of%20abstraction, 
and try https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/Text.d
Any feedback welcome (esp on reformulating the text concisely ;-)

> - Jonathan M Davis

Denis
_________________
vita es estrany
spir.wikidot.com
1 2 3 4 5 6 7
Top | Discussion index | About this forum | D home