October 28, 2011
Walter Bright , dans le message (digitalmars.D:147161), a écrit :
> On 10/20/2011 9:06 PM, Jonathan M Davis wrote:
>> It's this very problem that leads some people to argue that string should be its own type which holds an array of code units (which can be accessed when needed) rather than doing what we do now where we try and treat a string as both an array of chars and a range of dchars. The result is schizophrenic.
> 
> Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings.

I definitely agree with you, but I have a piece of news for you : The whole phobos alreaday treats strings as a dchar ranges, and IS inefficient for processing strings.

The fact is: char[] is not char[] in phobos. It is Range!dchar. This is aweful and schizophrenic, and also inefficient. The purpose was to allow people to use strings without knowing anything about unicode. That is why Jonathan proposed having a specific string structure to manipulate strings without having to worry about unicode, just like how strings are currently manipulated in phobos, and letting char[] be char[], and be as efficient as they should be.

I was not there when it was decided to treat strings as dchar ranges, but now it is done. The only thing I can do is use ubyte[] instead of char[] so that phobos treat them as proper arrays, and propose optimized overloads for various phobos algorithm to make them as efficient as they should be (which I didn't find the time to do yet).

-- 
Christophe
October 28, 2011
Dmitry Olshansky , dans le message (digitalmars.D:147415), a écrit :
> Assuming language support stays on stage of "codepoint is a character" it's totaly expected to ignore modifiers and compare identically normalized UTF without decoding. Yes, it risks to hit certain issues.

string being seen as range of codepoint (dchar) is already aweful enough. Now seeing strings as range of displayable caracters just do not make sense. Unicode is too complicated to allow doing this for a general purpose string manipulation. All the transformations to displayable characters can only be done when displaying characters !

Just like fiancé is hidden is you write fiance' (with the approriate unicode character to have the ' placed over the 'e'). You can hide any word by using delete characters. You have to make asumption on the input, and you have to put limitations to the algorithm because in any case, you can have unexpected behavior. And I can assure you there is less unexpected behavior if you treat strings as dchar range or even char[], than if you treat them as displayable characters.

> It's a complete mess even with proper decoding ;)

Sure, that's why we better not decode.
October 28, 2011
On 10/28/11 4:41 AM, Christophe Travert wrote:
> I definitely agree with you, but I have a piece of news for you :
> The whole phobos alreaday treats strings as a dchar ranges, and IS
> inefficient for processing strings.
>
> The fact is: char[] is not char[] in phobos. It is Range!dchar. This is
> aweful and schizophrenic, and also inefficient. The purpose was to allow
> people to use strings without knowing anything about unicode.

It was to allow people who know Unicode to use strings without too much arcana.

> That is
> why Jonathan proposed having a specific string structure to manipulate
> strings without having to worry about unicode, just like how strings are
> currently manipulated in phobos, and letting char[] be char[], and be as
> efficient as they should be.

No need. Use http://www.digitalmars.com/d/2.0/phobos/std_string.html#representation

> I was not there when it was decided to treat strings as dchar ranges,
> but now it is done. The only thing I can do is use ubyte[] instead of
> char[] so that phobos treat them as proper arrays, and propose optimized
> overloads for various phobos algorithm to make them as efficient as they
> should be (which I didn't find the time to do yet).

What would you have done if you were there? What is your design of choice?


Andrei
October 29, 2011
On 10/26/11 7:18 AM, Steven Schveighoffer wrote:
> On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
> <simen.kjaras@gmail.com> wrote:
>
>> On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
>> <schveiguy@yahoo.com> wrote:
>>
>>> Plus, a combining character (such as an umlaut or accent) is part of a
>>> character, but may be a separate code point.
>>
>> If this is correct (and it is), then decoding to dchar is simply not
>> enough.
>> You seem to advocate decoding to graphemes, which is a whole different
>> matter.
>
> I am advocating that. And it's a matter of perception. D can say "we
> only support code-point decoding" and what that means to a user is, "we
> don't support language as you know it." Sure it's a part of unicode, but
> it takes that extra piece to make it actually usable to people who
> require unicode.
>
> Even in English, fiancé has an accent. To say D supports unicode, but
> then won't do a simple search on a file which contains a certain *valid*
> encoding of that word is disingenuous to say the least.

Why doesn't that simple search work?

foreach (line; stdin.byLine()) {
    if (line.canFind("fiancé")) {
       writeln("There it is.");
    }
}

> D needs a fully unicode-aware string type. I advocate D should use it as
> the default string type, but it needs one whether it's the default or
> not in order to say it supports unicode.

How do you define "supports Unicode"? For my money, the main sin of (w)string is that it offers [] and .length with potentially confusing semantics, so if I could I'd curb, not expand, its interface.


Andrei
October 29, 2011
On Saturday, October 29, 2011 09:42:54 Andrei Alexandrescu wrote:
> On 10/26/11 7:18 AM, Steven Schveighoffer wrote:
> > On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
> > 
> > <simen.kjaras@gmail.com> wrote:
> >> On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
> >> 
> >> <schveiguy@yahoo.com> wrote:
> >>> Plus, a combining character (such as an umlaut or accent) is part of
> >>> a
> >>> character, but may be a separate code point.
> >> 
> >> If this is correct (and it is), then decoding to dchar is simply not
> >> enough.
> >> You seem to advocate decoding to graphemes, which is a whole different
> >> matter.
> > 
> > I am advocating that. And it's a matter of perception. D can say "we only support code-point decoding" and what that means to a user is, "we don't support language as you know it." Sure it's a part of unicode, but it takes that extra piece to make it actually usable to people who require unicode.
> > 
> > Even in English, fiancé has an accent. To say D supports unicode, but then won't do a simple search on a file which contains a certain *valid* encoding of that word is disingenuous to say the least.
> 
> Why doesn't that simple search work?
> 
> foreach (line; stdin.byLine()) {
>      if (line.canFind("fiancé")) {
>         writeln("There it is.");
>      }
> }

If the strings aren't normalized the same way, then it might not find fiancé. If they _are_ normalized the same way and fiancé is in there except that the é is actually modified by another code point after it (e.g. a subscript of 2 - not exactly likely in this case but certainly possible), then that string would be found when it shouldn't be. The bigger problem though, I think, is when you're searching for a string which is the same without the modifiers - which would be fiance in this case - since then if the modfiying code points are after, then find will think that it found the string that you were looking for when it didn't.

Once you're dealing with modifying code points, in the general case, you _must_ operate on the grapheme level to ensure that you find exactly what you're looking for and only what you're looking for. If we assume that all strings are normalized the same way and pick the right normalization for it (and provide a function to normalize strings that way of course), then we could probably make that work 100% of the time (assuming that there's a normalized form with all of the modifying code points being _before_ the code point that we modify and that no modifying code point can be a character on its own), but I'd have to study up on it more to be sure.

Regardless, while searching for fiancé has a decent chance of success (especially if programs generall favor using  single code points instead of multiple code points wherever possible), it's still a risky proposition without at least doing unicode normalization if not outright using a range of graphemes rather than code points.

- Jonathan M Davis
October 31, 2011
On Sat, 29 Oct 2011 10:42:54 -0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 10/26/11 7:18 AM, Steven Schveighoffer wrote:
>> On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
>> <simen.kjaras@gmail.com> wrote:
>>
>>> On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
>>> <schveiguy@yahoo.com> wrote:
>>>
>>>> Plus, a combining character (such as an umlaut or accent) is part of a
>>>> character, but may be a separate code point.
>>>
>>> If this is correct (and it is), then decoding to dchar is simply not
>>> enough.
>>> You seem to advocate decoding to graphemes, which is a whole different
>>> matter.
>>
>> I am advocating that. And it's a matter of perception. D can say "we
>> only support code-point decoding" and what that means to a user is, "we
>> don't support language as you know it." Sure it's a part of unicode, but
>> it takes that extra piece to make it actually usable to people who
>> require unicode.
>>
>> Even in English, fiancé has an accent. To say D supports unicode, but
>> then won't do a simple search on a file which contains a certain *valid*
>> encoding of that word is disingenuous to say the least.
>
> Why doesn't that simple search work?
>
> foreach (line; stdin.byLine()) {
>      if (line.canFind("fiancé")) {
>         writeln("There it is.");
>      }
> }

I think Jonathan answered that quite well, nothing else to add...

>
>> D needs a fully unicode-aware string type. I advocate D should use it as
>> the default string type, but it needs one whether it's the default or
>> not in order to say it supports unicode.
>
> How do you define "supports Unicode"? For my money, the main sin of (w)string is that it offers [] and .length with potentially confusing semantics, so if I could I'd curb, not expand, its interface.

LOL, I'm so used to programming that I'm trying to figure out what the meaning of sin(string) (as in sine) means :)

I think there are two problems with [] and .length.  First, that they imply "get nth character" and "number of characters" respectively, and second, that many times they *actually are* those things.

So I agree with you the proposed string type needs to curb that interface, while giving us a fully character/grapheme aware interface (which is currently lacking).

I made an early attempt at doing this, and I will eventually get around to finishing it.  I was in the middle of creating an algorithm to as-efficiently-as-possible delineate a grapheme, when I got sidetracked by other things :)  There are still lingering issues with the language which makes this a less-than-ideal replacement (arrays currently enjoy a lot of "extra features" that custom types do not).

-Steve
October 31, 2011
On 10/31/11 8:20 AM, Steven Schveighoffer wrote:
>> Why doesn't that simple search work?
>>
>> foreach (line; stdin.byLine()) {
>> if (line.canFind("fiancé")) {
>> writeln("There it is.");
>> }
>> }
>
> I think Jonathan answered that quite well, nothing else to add...

I see this as a decoding primitive issue that can be resolved.

Andrei
1 2 3 4 5 6 7
Next ›   Last »