VLERange: a range in between BidirectionalRange and RandomAccessRange (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » VLERange: a range in between BidirectionalRange and RandomAccessRange (page 3)

January 12, 2011

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Posted by spir
in reply to Don

spir

Posted in reply to Don

On 01/12/2011 08:28 PM, Don wrote:
> I think the only problem that we really have, is that "char[]",
> "dchar[]" implies that code points is always the appropriate level of
> abstraction.

I'd like to know when it happens that codepoint is the appropriate level of abstraction.
* If pieces of text are not manipulated, meaning just used in the application, or just transferred via the application as is (from file / input / literal to any kind of output), then any kind of encoding just works. One can even concatenate, provided all pieces use the same encoding. --> _lower_ level than codepoint is OK.
* But any of manipulation (indexing, slicing, compare, search, count, replace, not to speak about regex/parsing) requires operating at the _higher_ level of characters (in the common sense). Just like with historic character sets in which codes used to represent characters (not lower-level thingies as in UCS). Else, one reads, compares, changes meaningless bits of text.

As I see it now, we need 2 types:
* One plain string similar to good old ones (bytestring would do the job, since most unicode is utf8 encoded) for the first kind of use above. With optional validity check when it's supposed to be unicode text.
* One hiher-level type abstracting from codepoint (not code unit) issues, restoring the necessary properties: (1) each character is one element in the sequence (2) each character is always represented the same way.

Denis
_________________
vita es estrany
spir.wikidot.com

January 12, 2011

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Posted by Ali Çehreli
in reply to spir

Ali Çehreli

Posted in reply to spir

spir wrote:
> On 01/12/2011 08:28 PM, Don wrote:
>> I think the only problem that we really have, is that "char[]",
>> "dchar[]" implies that code points is always the appropriate level of
>> abstraction.
>
> I'd like to know when it happens that codepoint is the appropriate level
> of abstraction.

When on a document that describes code points... :)

> * If pieces of text are not manipulated, meaning just used in the
> application, or just transferred via the application as is (from file /
> input / literal to any kind of output), then any kind of encoding just
> works. One can even concatenate, provided all pieces use the same
> encoding. --> _lower_ level than codepoint is OK.
> * But any of manipulation (indexing, slicing, compare,

Compare according to which alphabet's ordering? Surely not Unicode's... I may be alone in this, but ordering is tied to an alphabet (or writing system), not locale.)

I try to solve that issue with my trileri library:

  http://code.google.com/p/trileri/source/browse/#svn%2Ftrunk%2Ftr

Warning: the code is in Turkish and is not aware of the concept of collation at all; it has its own simplistic view of text, where every character is an entity that can be lower/upper cased to a single character.

> search, count,
> replace, not to speak about regex/parsing) requires operating at the
> _higher_ level of characters (in the common sense).

I don't know this about Unicode: should e and ´ (acute accent) be always collated? If so, wouldn't it be impossible to put those two in that order say, in a text book? (Perhaps Unicode defines a way to stop collation.)

> Just like with
> historic character sets in which codes used to represent characters (not
> lower-level thingies as in UCS). Else, one reads, compares, changes
> meaningless bits of text.
>
> As I see it now, we need 2 types:

I think we need more than 2 types...

> * One plain string similar to good old ones (bytestring would do the
> job, since most unicode is utf8 encoded) for the first kind of use
> above. With optional validity check when it's supposed to be unicode text.

Agreed. D gives us three UTF encondings, but I am not sure that there is only one abstraction above that.

> * One hiher-level type abstracting from codepoint (not code unit)
> issues, restoring the necessary properties: (1) each character is one
> element in the sequence (2) each character is always represented the
> same way.

I think VLERange should solve only the variable-length-encoding issue. It should not get into higher abstractions.

Ali

January 13, 2011

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Posted by Michel Fortin
in reply to spir

Michel Fortin

Posted in reply to spir

On 2011-01-12 14:57:58 -0500, spir <denis.spir@gmail.com> said:

> On 01/12/2011 08:28 PM, Don wrote:
>> I think the only problem that we really have, is that "char[]",
>> "dchar[]" implies that code points is always the appropriate level of
>> abstraction.
> 
> I'd like to know when it happens that codepoint is the appropriate level of abstraction.

I agree with you. I don't see many use for code points.

One of these uses is writing a parser for a format defined in term of code points (XML for instance). But beyond that, I don't see one.

> * If pieces of text are not manipulated, meaning just used in the application, or just transferred via the application as is (from file / input / literal to any kind of output), then any kind of encoding just works. One can even concatenate, provided all pieces use the same encoding. --> _lower_ level than codepoint is OK.
> * But any of manipulation (indexing, slicing, compare, search, count, replace, not to speak about regex/parsing) requires operating at the _higher_ level of characters (in the common sense). Just like with historic character sets in which codes used to represent characters (not lower-level thingies as in UCS). Else, one reads, compares, changes meaningless bits of text.

Very true. In the same way that code points can span on multiple code units, user-perceived characters (graphemes) can span on multiple code points.

A funny exercise to make a fool of an algorithm working only with code points would be to replace the word "fortune" in a text containing the word "fortuné". If the last "é" is expressed as two code points, as "e" followed by a combining acute accent (this: é), replacing occurrences of "fortune" by "expose" would also replace "fortuné" with "exposé" because the combining acute accent remains as the code point following the word. Quite amusing, but it doesn't really make sense that it works like that.

In the case of "é", we're lucky enough to also have a pre-combined character to encode it as a single code point, so encountering "é" written as two code points is quite rare. But not all combinations of marks and characters can be represented as a single code point. The correct thing to do is to treat "é" (single code point) and "é" ("e" + combining acute accent) as equivalent.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 13, 2011

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Posted by Michel Fortin
in reply to Michel Fortin

Michel Fortin

Posted in reply to Michel Fortin

On 2011-01-12 19:45:36 -0500, Michel Fortin <michel.fortin@michelf.com> said:

> A funny exercise to make a fool of an algorithm working only with code points would be to replace the word "fortune" in a text containing the word "fortuné". If the last "é" is expressed as two code points, as "e" followed by a combining acute accent (this: é), replacing occurrences of "fortune" by "expose" would also replace "fortuné" with "exposé" because the combining acute accent remains as the code point following the word. Quite amusing, but it doesn't really make sense that it works like that.
> 
> In the case of "é", we're lucky enough to also have a pre-combined character to encode it as a single code point, so encountering "é" written as two code points is quite rare. But not all combinations of marks and characters can be represented as a single code point. The correct thing to do is to treat "é" (single code point) and "é" ("e" + combining acute accent) as equivalent.

Crap, I meant to send this as UTF-8 with combining characters in it, but my news client converted everything to ISO-8859-1.

I'm not sure it'll work, but here's my second attempt at posting real combining marks:

	Single code point: é
	e with combining mark: é
	t with combining mark: t̂
	t with two combining marks: t̂̃

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 13, 2011

Unicode's proper level of abstraction? [was: Re: VLERange:...]

Posted by spir
in reply to Michel Fortin

spir

Posted in reply to Michel Fortin

On 01/13/2011 01:45 AM, Michel Fortin wrote:
> On 2011-01-12 14:57:58 -0500, spir <denis.spir@gmail.com> said:
>
>> On 01/12/2011 08:28 PM, Don wrote:
>>> I think the only problem that we really have, is that "char[]",
>>> "dchar[]" implies that code points is always the appropriate level of
>>> abstraction.
>>
>> I'd like to know when it happens that codepoint is the appropriate
>> level of abstraction.
>
> I agree with you. I don't see many use for code points.
>
> One of these uses is writing a parser for a format defined in term of
> code points (XML for instance). But beyond that, I don't see one.

Actually, I had once a real use case for codepoint beeing the proper level of abstraction: a linguistic app of which one operational func counts occurrences of "scripting marks" like 'a' & '¨' in "ä". hope you see what I mean.
Once the text is properly NFD decomposed, each of those marks in coded as a codepoint. (But if it's not decomposed, then most of those marks are probably hidden by precomposed codes coding characters like "ä".) So that even such an app benefits from a higher-level type basically operating on normalised (NFD) characters.

>> * If pieces of text are not manipulated, meaning just used in the
>> application, or just transferred via the application as is (from file
>> / input / literal to any kind of output), then any kind of encoding
>> just works. One can even concatenate, provided all pieces use the same
>> encoding. --> _lower_ level than codepoint is OK.
>> * But any of manipulation (indexing, slicing, compare, search, count,
>> replace, not to speak about regex/parsing) requires operating at the
>> _higher_ level of characters (in the common sense). Just like with
>> historic character sets in which codes used to represent characters
>> (not lower-level thingies as in UCS). Else, one reads, compares,
>> changes meaningless bits of text.
>
> Very true. In the same way that code points can span on multiple code
> units, user-perceived characters (graphemes) can span on multiple code
> points.
>
> A funny exercise to make a fool of an algorithm working only with code
> points would be to replace the word "fortune" in a text containing the
> word "fortuné". If the last "é" is expressed as two code points, as "e"
> followed by a combining acute accent (this: é), replacing occurrences of
> "fortune" by "expose" would also replace "fortuné" with "exposé" because
> the combining acute accent remains as the code point following the word.
> Quite amusing, but it doesn't really make sense that it works like that.
>
> In the case of "é", we're lucky enough to also have a pre-combined
> character to encode it as a single code point, so encountering "é"
> written as two code points is quite rare. But not all combinations of
> marks and characters can be represented as a single code point. The
> correct thing to do is to treat "é" (single code point) and "é" ("e" +
> combining acute accent) as equivalent.

You'll find another example in the introduction of the text at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level%20of%20abstraction

About your last remark, this is precisely one of the two abstractions my Text type provides: it groups togeter in "piles" codes that belong to the same "true" character (grapheme) like "é". So that the resulting text representation is a sequence of "piles", each representing a character. Consequence: indexing, slicing, etc work sensibly (and even other operations are faster for they do not need to perform that "piling" again & again).
In addition to that, the string is first NFD-normalised, thus each chraracter can have one & only representation. Consequence: search, count, replace, etc, and compare (*) work as expected. In your case:
    // 2 forms of "é"
    assert(Text("\u00E9") == Text("\u0065\u0301"));

Denis

(*) According to UCS coding, not language-specific idiosyncrasies.
More generally, Text abstract from lower-level issues _introduced_ by UCS, Unicode's character set. It does not code with script-, language-, culture-, domain-, app- specific needs such as custom text sorting rules. Some base routines for such operations are provided by Text's brother lib DUnicode (access to some code properties, safe concat, casefolded compare, NF* normalisation).
_________________
vita es estrany
spir.wikidot.com

January 13, 2011

Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]

Posted by Jonathan M Davis

Jonathan M Davis

On Thursday 13 January 2011 01:49:31 spir wrote:
> On 01/13/2011 01:45 AM, Michel Fortin wrote:
> > On 2011-01-12 14:57:58 -0500, spir <denis.spir@gmail.com> said:
> >> On 01/12/2011 08:28 PM, Don wrote:
> >>> I think the only problem that we really have, is that "char[]", "dchar[]" implies that code points is always the appropriate level of abstraction.
> >> 
> >> I'd like to know when it happens that codepoint is the appropriate level of abstraction.
> > 
> > I agree with you. I don't see many use for code points.
> > 
> > One of these uses is writing a parser for a format defined in term of code points (XML for instance). But beyond that, I don't see one.
> 
> Actually, I had once a real use case for codepoint beeing the proper
> level of abstraction: a linguistic app of which one operational func
> counts occurrences of "scripting marks" like 'a' & '¨' in "ä". hope you
> see what I mean.
> Once the text is properly NFD decomposed, each of those marks in coded
> as a codepoint. (But if it's not decomposed, then most of those marks
> are probably hidden by precomposed codes coding characters like "ä".) So
> that even such an app benefits from a higher-level type basically
> operating on normalised (NFD) characters.

There's also the question of efficiency. On the whole, string operations can be very expensive - particularly when you're doing a lot of them. The fact that D's arrays are so powerful may reduce the problem in D, but in general, if you're doing a lot with strings, it can get costly, performance-wise.

The question then is what is the cost of actually having strings abstracted to the point that they really are ranges of characters rather than code units or code points or whatever? If the cost is large enough, then dealing with strings as arrays as they currently are and having the occasional unicode issue could very well be worth it. As it is, there are plenty of people who don't want to have to care about unicode in the first place, since the programs that they write only deal with ASCII characters. The fact that D makes it so easy to deal with unicode code points is a definite improvement, but taking the abstraction to the point that you're definitely dealing with characters rather than code units or code points could be too costly.

Now, if it can be done efficiently, then having unicode dealt with properly without the programmer having to worry about it would be a big boon. As it is, D's handling of unicode is a big boon, even if it doesn't deal with graphemes and the like.

So, I think that we definitely should have an abstraction for unicode which uses characters as the elements in the range and doesn't have to care about the underlying encoding of the characters (except perhaps picking whether char, wchar, or dchar is use internally, and therefore how much space it requires). However, I'm not at all convinced that such an abstraction can be done efficiently enough to make it the default way of handling strings.

- Jonathan M Davis

January 13, 2011

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Posted by spir
in reply to Michel Fortin

spir

Posted in reply to Michel Fortin

On 01/13/2011 01:51 AM, Michel Fortin wrote:
> On 2011-01-12 19:45:36 -0500, Michel Fortin <michel.fortin@michelf.com>
> said:
>
>> A funny exercise to make a fool of an algorithm working only with code
>> points would be to replace the word "fortune" in a text containing the
>> word "fortuné". If the last "é" is expressed as two code points, as
>> "e" followed by a combining acute accent (this: é), replacing
>> occurrences of "fortune" by "expose" would also replace "fortuné" with
>> "exposé" because the combining acute accent remains as the code point
>> following the word. Quite amusing, but it doesn't really make sense
>> that it works like that.
>>
>> In the case of "é", we're lucky enough to also have a pre-combined
>> character to encode it as a single code point, so encountering "é"
>> written as two code points is quite rare. But not all combinations of
>> marks and characters can be represented as a single code point. The
>> correct thing to do is to treat "é" (single code point) and "é" ("e" +
>> combining acute accent) as equivalent.
>
> Crap, I meant to send this as UTF-8 with combining characters in it, but
> my news client converted everything to ISO-8859-1.
>
> I'm not sure it'll work, but here's my second attempt at posting real
> combining marks:
>
> Single code point: é
> e with combining mark: é
> t with combining mark: t̂
> t with two combining marks: t̂̃

Works :-) But your first post worked as well by me: for instance <<"é" ("e" + combining acute accent)>> was displayed "é" as a single accented letter. I guess maybe your email client did not convert into iso-8859-1 on sending, but on reading (mine is set for utf-8).

Denis
_________________
vita es estrany
spir.wikidot.com

January 13, 2011

Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]

Posted by spir

spir

On 01/13/2011 11:16 AM, Jonathan M Davis wrote:
> On Thursday 13 January 2011 01:49:31 spir wrote:
>> On 01/13/2011 01:45 AM, Michel Fortin wrote:
>>> On 2011-01-12 14:57:58 -0500, spir<denis.spir@gmail.com>  said:
>>>> On 01/12/2011 08:28 PM, Don wrote:
>>>>> I think the only problem that we really have, is that "char[]",
>>>>> "dchar[]" implies that code points is always the appropriate level of
>>>>> abstraction.
>>>>
>>>> I'd like to know when it happens that codepoint is the appropriate
>>>> level of abstraction.
>>>
>>> I agree with you. I don't see many use for code points.
>>>
>>> One of these uses is writing a parser for a format defined in term of
>>> code points (XML for instance). But beyond that, I don't see one.
>>
>> Actually, I had once a real use case for codepoint beeing the proper
>> level of abstraction: a linguistic app of which one operational func
>> counts occurrences of "scripting marks" like 'a'&  '¨' in "ä". hope you
>> see what I mean.
>> Once the text is properly NFD decomposed, each of those marks in coded
>> as a codepoint. (But if it's not decomposed, then most of those marks
>> are probably hidden by precomposed codes coding characters like "ä".) So
>> that even such an app benefits from a higher-level type basically
>> operating on normalised (NFD) characters.
>
> There's also the question of efficiency. On the whole, string operations can be
> very expensive - particularly when you're doing a lot of them. The fact that D's
> arrays are so powerful may reduce the problem in D, but in general, if you're
> doing a lot with strings, it can get costly, performance-wise.

D's arrays (even dchar[] & dstring) do not allow having correct results when dealing with UCS/Unicode text in the general case. See Michel's example (and several ones I posted on this list, and the text at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level%20of%20abstraction for a very lengthy explanation).
You and some other people seem to still mistake Unicode's low level issue of codepoint vs code unit, with the higher-level issue of codes _not_ representing characters in the commmon sense ("graphemes").

The above pointed text was written precisely to introduce to this issue because obviously no-one wants to face it... (Eg each time I evoke it on this list it is ignored, except by Michel, but the same is true everywhere else, including on the Unicode mailing list!). The core of the problem is the misleading term "abstract character" which deceivingly lets programmers believe that a codepoints codes a character, like in historic character sets -- which is *wrong*. No Unicode document AFAIK explains this. This is a case of unsaid lie.
Compared to legacy charsets, dealing with Unicode actually requires *2* levels of abstraction... (one to decode codepoints from code units, one to construct characters from codepoints)

Note that D's stdlib currently provides no means to do this, not even on the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode library) (good luck ;-). But even ICU, as well as supposed unicode-aware typse or librarys for any language, would give you an abstraction producing correct results for Michel's example. For instance, Python3 code fails as miserably as any other. AFAIK, D is the first and only language having such a tool (Text.d at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).

> The question then is what is the cost of actually having strings abstracted to
> the point that they really are ranges of characters rather than code units or
> code points or whatever? If the cost is large enough, then dealing with strings
> as arrays as they currently are and having the occasional unicode issue could
> very well be worth it. As it is, there are plenty of people who don't want to
> have to care about unicode in the first place, since the programs that they write
> only deal with ASCII characters. The fact that D makes it so easy to deal with
> unicode code points is a definite improvement, but taking the abstraction to the
> point that you're definitely dealing with characters rather than code units or
> code points could be too costly.

When _manipulating_ text (indexing, search, changing), you have the choice between:
* On the fly abstraction (composing characters on the fly, and/or normalising them), for each operation for each piece of text (including parameters, including literals).
* Use of a type that constructs this abstraction once only for each piece of text.
Note that a single count operation is forced to construct this abstraction on the fly for the whole text... (and for the searched snippet).
Also note that optimisation is probably easier is the second case, for the abstraction operation is then standard.

> Now, if it can be done efficiently, then having unicode dealt with properly
> without the programmer having to worry about it would be a big boon. As it is,
> D's handling of unicode is a big boon, even if it doesn't deal with graphemes
> and the like.

It has a cost at intial Text construction time. Currently, on my very slow computer, 1MB source text requires ~ 500 ms (decoding + decomposition + ordering + "piling" codes into characters). Decoding only using D's builtin std.utf.decode takes about 100 ms.
The bottle neck is piling: 70% of the time in average, on a test case melting texts from a dozen natural languages. We would be very glad to get the community's help in optimising this phase :-)
(We have progressed very much already in terms of speed, but now reach limits of our competences.)

> So, I think that we definitely should have an abstraction for unicode which uses
> characters as the elements in the range and doesn't have to care about the
> underlying encoding of the characters (except perhaps picking whether char,
> wchar, or dchar is use internally, and therefore how much space it requires).
> However, I'm not at all convinced that such an abstraction can be done efficiently
> enough to make it the default way of handling strings.

If you only have ASCII, or if you don't manipulate text at all, then as said in a previous post any string representation works fine (whatever the encoding it possibly uses under the hood).
D's builtin char/dchar/wchar and string/dstring/wstring are very nice and well done, but they are not necessary in such a use case. Actually, as shown by Steven's repeted complaints, they rather get in the way when dealing with non-unicode source data (IIUC, by assuming string elements are utf codes).

And they do not even try to solve the real issues one necessarily meets when manipulating unicode texts, which are due to UCS's coding format. Thus my previous statement: the level of codepoints is nearly never the proper level of abstraction.

> - Jonathan M Davis

Denis
_________________
vita es estrany
spir.wikidot.com

January 13, 2011

Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]

Posted by Jonathan M Davis

Jonathan M Davis

On Thursday 13 January 2011 03:48:46 spir wrote:
> On 01/13/2011 11:16 AM, Jonathan M Davis wrote:
> > On Thursday 13 January 2011 01:49:31 spir wrote:
> >> On 01/13/2011 01:45 AM, Michel Fortin wrote:
> >>> On 2011-01-12 14:57:58 -0500, spir<denis.spir@gmail.com>  said:
> >>>> On 01/12/2011 08:28 PM, Don wrote:
> >>>>> I think the only problem that we really have, is that "char[]", "dchar[]" implies that code points is always the appropriate level of abstraction.
> >>>> 
> >>>> I'd like to know when it happens that codepoint is the appropriate level of abstraction.
> >>> 
> >>> I agree with you. I don't see many use for code points.
> >>> 
> >>> One of these uses is writing a parser for a format defined in term of code points (XML for instance). But beyond that, I don't see one.
> >> 
> >> Actually, I had once a real use case for codepoint beeing the proper
> >> level of abstraction: a linguistic app of which one operational func
> >> counts occurrences of "scripting marks" like 'a'&  '¨' in "ä". hope you
> >> see what I mean.
> >> Once the text is properly NFD decomposed, each of those marks in coded
> >> as a codepoint. (But if it's not decomposed, then most of those marks
> >> are probably hidden by precomposed codes coding characters like "ä".) So
> >> that even such an app benefits from a higher-level type basically
> >> operating on normalised (NFD) characters.
> > 
> > There's also the question of efficiency. On the whole, string operations can be very expensive - particularly when you're doing a lot of them. The fact that D's arrays are so powerful may reduce the problem in D, but in general, if you're doing a lot with strings, it can get costly, performance-wise.
> 
> D's arrays (even dchar[] & dstring) do not allow having correct results
> when dealing with UCS/Unicode text in the general case. See Michel's
> example (and several ones I posted on this list, and the text at
> https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20le
> vel%20of%20abstraction for a very lengthy explanation).
> You and some other people seem to still mistake Unicode's low level
> issue of codepoint vs code unit, with the higher-level issue of codes
> _not_ representing characters in the commmon sense ("graphemes").
> 
> The above pointed text was written precisely to introduce to this issue because obviously no-one wants to face it... (Eg each time I evoke it on this list it is ignored, except by Michel, but the same is true everywhere else, including on the Unicode mailing list!). The core of the problem is the misleading term "abstract character" which deceivingly lets programmers believe that a codepoints codes a character, like in historic character sets -- which is *wrong*. No Unicode document AFAIK explains this. This is a case of unsaid lie. Compared to legacy charsets, dealing with Unicode actually requires *2* levels of abstraction... (one to decode codepoints from code units, one to construct characters from codepoints)
> 
> Note that D's stdlib currently provides no means to do this, not even on the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode library) (good luck ;-). But even ICU, as well as supposed unicode-aware typse or librarys for any language, would give you an abstraction producing correct results for Michel's example. For instance, Python3 code fails as miserably as any other. AFAIK, D is the first and only language having such a tool (Text.d at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).
> 
> > The question then is what is the cost of actually having strings abstracted to the point that they really are ranges of characters rather than code units or code points or whatever? If the cost is large enough, then dealing with strings as arrays as they currently are and having the occasional unicode issue could very well be worth it. As it is, there are plenty of people who don't want to have to care about unicode in the first place, since the programs that they write only deal with ASCII characters. The fact that D makes it so easy to deal with unicode code points is a definite improvement, but taking the abstraction to the point that you're definitely dealing with characters rather than code units or code points could be too costly.
> 
> When _manipulating_ text (indexing, search, changing), you have the
> choice between:
> * On the fly abstraction (composing characters on the fly, and/or
> normalising them), for each operation for each piece of text (including
> parameters, including literals).
> * Use of a type that constructs this abstraction once only for each
> piece of text.
> Note that a single count operation is forced to construct this
> abstraction on the fly for the whole text... (and for the searched
> snippet). Also note that optimisation is probably easier is the second
> case, for the abstraction operation is then standard.
> 
> > Now, if it can be done efficiently, then having unicode dealt with properly without the programmer having to worry about it would be a big boon. As it is, D's handling of unicode is a big boon, even if it doesn't deal with graphemes and the like.
> 
> It has a cost at intial Text construction time. Currently, on my very
> slow computer, 1MB source text requires ~ 500 ms (decoding +
> decomposition + ordering + "piling" codes into characters). Decoding
> only using D's builtin std.utf.decode takes about 100 ms.
> The bottle neck is piling: 70% of the time in average, on a test case
> melting texts from a dozen natural languages. We would be very glad to
> get the community's help in optimising this phase :-)
> (We have progressed very much already in terms of speed, but now reach
> limits of our competences.)
> 
> > So, I think that we definitely should have an abstraction for unicode which uses characters as the elements in the range and doesn't have to care about the underlying encoding of the characters (except perhaps picking whether char, wchar, or dchar is use internally, and therefore how much space it requires). However, I'm not at all convinced that such an abstraction can be done efficiently enough to make it the default way of handling strings.
> 
> If you only have ASCII, or if you don't manipulate text at all, then as
> said in a previous post any string representation works fine (whatever
> the encoding it possibly uses under the hood).
> D's builtin char/dchar/wchar and string/dstring/wstring are very nice
> and well done, but they are not necessary in such a use case. Actually,
> as shown by Steven's repeted complaints, they rather get in the way when
> dealing with non-unicode source data (IIUC, by assuming string elements
> are utf codes).
> 
> And they do not even try to solve the real issues one necessarily meets when manipulating unicode texts, which are due to UCS's coding format. Thus my previous statement: the level of codepoints is nearly never the proper level of abstraction.

I wasn't saying that code points are guaranteed to be characters. I was saying that in most cases they are, so if efficiency is an issue, then having properly abstract characters could be too costly. However, having a range type which properly abstracts characters and deals with whatever graphemes and normalization and whatnot that it has to would be a very good thing to have. The real question is whether it can be made efficient enough to even consider using it normally instead of just when you know that you're really going to need it.

The fact that you're seeing such a large drop in performance with your Text type definitely would support the idea that it could be just plain too expensive to use such a type in the average case. Even something like a 20% drop in performance could be devastating if you're dealing with code which does a lot of string processing. Regardless though, there will obviously be cases where you'll need something like your Text type if you want to process unicode correctly.

However, regardless of what the best way to handle unicode is in general, I think that it's painfully clear that your average programmer doesn't know much about unicode. Even understanding the nuances between char, wchar, and dchar is more than your average programmer seems to understand at first. The idea that a char wouldn't be guaranteed to be an actual character is not something that many programmers take to immediately. It's quite foreign to how chars are typically dealt with in other languages, and many programmers never worry about unicode at all, only dealing with ASCII. So, not only is unicode a rather disgusting problem, but it's not one that your average programmer begins to grasp as far as I've seen. Unless the issue is abstracted away completely, it takes a fair bit of explaining to understand how to deal with unicoder properly.

- Jonathan M Davis

January 13, 2011

Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]

Posted by spir

spir

On 01/13/2011 01:10 PM, Jonathan M Davis wrote:
> I wasn't saying that code points are guaranteed to be characters. I was saying
> that in most cases they are, so if efficiency is an issue, then having properly
> abstract characters could be too costly.

The problem is then: how does a library or application programmer know, for sure, that all true characters (graphemes) from all source texts its software will ever deal with are coded with a single codepoint?
If you cope with ASCII only now & forever, then you know that.
If you do not manipulate text at all, then the question vanishes.

Else, you cannot know, I guess. The problem is partially masked because, most of us currently process only western language sources, for which scripts there exist precomposed codes for every _predefine_ character, and text-producing software (like editors) usually use precomposed codes when available. Hope I'm clear.
(I hope this use of precomposed codes will change because the gain in space for western langs is ridiculous and the cost in processing is instead relevant.)
In the future, all of this may change, so that the issue would more often be obvious for many programmers dealing with international text. Note that even now nothing prevents a user (including a programmer in source code!), even less a text-producing software, to use decomposed coding (the right choice imo). And there are true characters, and you can "invent" as many fancy characters you like, for which no precomposed code is defined, indeed. All of this is valid unicode and must be properly dealt with.

> However, having a range type which
> properly abstracts characters and deals with whatever graphemes and
> normalization and whatnot that it has to would be a very good thing to have. The real question is whether it can be made efficient enough to even consider using it normally instead of just when you know that you're really going to need it.

Upon range, we initially planned to expose a range interface in our type for iteration, instead of opApply, for better integration with coming D2 style, and algorithms. But had to let it down due to a few range bugs exposed in a previous thread (search for "range usability" IIRC).

> The fact that you're seeing such a large drop in performance with your Text type
> definitely would support the idea that it could be just plain too expensive to
> use such a type in the average case. Even something like a 20% drop in
> performance could be devastating if you're dealing with code which does a lot of
> string processing. Regardless though, there will obviously be cases where you'll
> need something like your Text type if you want to process unicode correctly.

The question of efficency is not as you present it. If you cannot guarantee that every character is coded by a single code (in all pieces of text, including params and literal), then you *must* construct an abstraction at the level of true characters --and even probably normalise them.
You have the choice of doing it on the fly for _every_ operation, or using a tool like the type Text. In the latter case, not only everything is far simpler for client code, but the abstraction is constructed only once (and forever ;-).

In the first case, the cost is the same (or rather higher because optimisation can probably be more efficient for a single standard case than for various operation cases); but _multiplied_ by the number of operations you need to perform on each piece of text. Thus, for a given operation, you get the slowest possible run: for instance indexing is O(k*n) where k is the cost of "piling" a single char, and n the char count...

In the second case, the efficiency issue happens only initially for each piece of text. Then, every operation is as fast as possible: indexing is indeed O(1).
But: this O(1) is slightly slower than with historic charsets because characters are now represented by mini code arrays instead of single codes. The same point applies even more for every operation involving compares (search, count, replace). We cannot solve this: it is due to UCS's coding scheme.

> However, regardless of what the best way to handle unicode is in general, I
> think that it's painfully clear that your average programmer doesn't know much
> about unicode.

True. Even those who think they are informed. Because Unicode's docs all not only ignore the problem, but contribute to creating it by using the deceiving term "abstract character" (and often worse, "character" alone) to denote what a codepoint codes. All articles I have ever read _about_ Unicode by third party simply follow. Evoking this issue on the unicode mailing list usually results in plain silence.

> Even understanding the nuances between char, wchar, and dchar is
> more than your average programmer seems to understand at first. The idea that a
> char wouldn't be guaranteed to be an actual character is not something that many
> programmers take to immediately. It's quite foreign to how chars are typically
> dealt with in other languages, and many programmers never worry about unicode at
> all, only dealing with ASCII.

(average programmer ? ;-)
Not that much to "how chars are typically dealt with in other languages", rather to how characters were coded in historic charsets. Other languages ignore the issue, and thus run incorrectly with universal text, the same way as D's builtin tools do it.
About ASCII, note that the only kind of source it's able to encode is plain english text, without any bit of fancy thingy in it. A single non-breaking space, "≥", "×" (product U+00D7), or using a letter imported from foreign language like in "à la", same for "αβγ", not to evoke "©" & "®"...

> So, not only is unicode a rather disgusting
> problem, but it's not one that your average programmer begins to grasp as far as
> I've seen. Unless the issue is abstracted away completely, it takes a fair bit
> of explaining to understand how to deal with unicoder properly.

Please have a look at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3, read https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level%20of%20abstraction, and try https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/Text.d
Any feedback welcome (esp on reformulating the text concisely ;-)

> - Jonathan M Davis

Denis
_________________
vita es estrany
spir.wikidot.com

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation