October 24, 2011
On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright <newshound2@digitalmars.com> wrote:

> On 10/22/2011 2:21 AM, Peter Alexander wrote:
>> Which operations do you believe would be less efficient?
>
> All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.

Searching that does not do decoding is fundamentally incorrect.  That is, if you want to find a substring in a string, you cannot just compare chars.

But if you want to do the fundamentally incorrect thing (perhaps because you as the programmer know the limits of what the data may contain), then you should be able to do that for efficiency.  But to default to something incorrect and then claim it's for efficiency is almost laughable.

If I gave you an O(lg(n)) shortest path algorithm that did not always find the shortest path, would you agree it was an improvment over dijkstra's?

-Steve
October 24, 2011
On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer <schveiguy@yahoo.com> wrote:

> On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright <newshound2@digitalmars.com> wrote:
>
>> On 10/22/2011 2:21 AM, Peter Alexander wrote:
>>> Which operations do you believe would be less efficient?
>>
>> All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.
>
> Searching that does not do decoding is fundamentally incorrect.  That is, if you want to find a substring in a string, you cannot just compare chars.

Assuming both string are valid UTF-8, you can. Continuation bytes can never
be confused with the first byte of a code point, and the first byte always
identifies how many continuation bytes there should be.

-- 
  Simen
October 24, 2011
On 21.10.2011 06:06, Jonathan M Davis wrote:
> It's this very problem that leads some people to argue that string should be
> its own type which holds an array of code units (which can be accessed when
> needed) rather than doing what we do now where we try and treat a string as
> both an array of chars and a range of dchars. The result is schizophrenic.

Indeed - expressing strings as arrays of characters will always fall short of the unicode concept in some way. A true unicode-compliant languages have to handle strings as opaque objects that do not have any encoding. There is a number of operations that can be done with these objects (concatenation, comparison, searching, etc.). Any kind of defined memory representation can only be obtained by an explicit encoding operation.

Python3, for example, did a fundamental step by introducing this fundamental distinction. At first it seems silly, having to think about encodings so often when writing trivial code. After a short while, the strict conceptual separation between unencoded "strings" and encoded "arrays of something" really helps avoiding ugly problems.

Sure, for a performance-critical language, the issue becomes a lot trickier. I still think it is possible and ultimately the only way to solve tricky problems that will otherwise always crop up somewhere.
October 24, 2011
On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
<simen.kjaras@gmail.com> wrote:

> On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer <schveiguy@yahoo.com> wrote:
>
>> On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright <newshound2@digitalmars.com> wrote:
>>
>>> On 10/22/2011 2:21 AM, Peter Alexander wrote:
>>>> Which operations do you believe would be less efficient?
>>>
>>> All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.
>>
>> Searching that does not do decoding is fundamentally incorrect.  That is, if you want to find a substring in a string, you cannot just compare chars.
>
> Assuming both string are valid UTF-8, you can. Continuation bytes can never
> be confused with the first byte of a code point, and the first byte always
> identifies how many continuation bytes there should be.
>

As others have pointed out in the past to me (and I thought as you did
once), the same characters can be encoded in *different ways*.  They must
be normalized to accurately compare.

Plus, a combining character (such as an umlaut or accent) is part of a
character, but may be a separate code point.  If that's on the last
character in the word such as fiancé, then searching for fiance will
result in a match without proper decoding!  Or if fiancé uses a
precomposed é, it won't match.  So two valid representations of the word
either match or they don't.  It's just a complete mess without proper
unicode decoding.

-Steve
October 24, 2011
On 24.10.2011 23:41, Steven Schveighoffer wrote:
> On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
> <simen.kjaras@gmail.com> wrote:
>
>> On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
>> <schveiguy@yahoo.com> wrote:
>>
>>> On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
>>> <newshound2@digitalmars.com> wrote:
>>>
>>>> On 10/22/2011 2:21 AM, Peter Alexander wrote:
>>>>> Which operations do you believe would be less efficient?
>>>>
>>>> All of the ones that don't require decoding, such as searching,
>>>> would be less efficient if decoding was done.
>>>
>>> Searching that does not do decoding is fundamentally incorrect. That
>>> is, if you want to find a substring in a string, you cannot just
>>> compare chars.
>>
>> Assuming both string are valid UTF-8, you can. Continuation bytes can
>> never
>> be confused with the first byte of a code point, and the first byte
>> always
>> identifies how many continuation bytes there should be.
>>
>
> As others have pointed out in the past to me (and I thought as you did
> once), the same characters can be encoded in *different ways*. They must
> be normalized to accurately compare.
>

Assuming language support stays on stage of "codepoint is a character" it's totaly expected to ignore modifiers and compare identically normalized UTF without decoding. Yes, it risks to hit certain issues.

> Plus, a combining character (such as an umlaut or accent) is part of a
> character, but may be a separate code point. If that's on the last
> character in the word such as fiancé, then searching for fiance will
> result in a match without proper decoding!

Now if you are going to do real characters... If source/needle are normalized you still can avoid lots of work by searching without decoding. All you need to decode is one codepoint on each successful match to see if there is a modifier at end of matched portion.
But it depends on how you want to match if it's case-insensitive search it will be a lot more complicated, but anyway it boils down to this:
1) do inexact search, get likely match ( false positives are OK, negatives not) no decoding here
2) once found check it (or parts of it) with proper decoding

There are cultural subtleties, that complicate these steps if you take them into account, but it's doable.

Or if fiancé uses a
> precomposed é, it won't match. So two valid representations of the word
> either match or they don't. It's just a complete mess without proper
> unicode decoding.

It's a complete mess even with proper decoding ;)


>
> -Steve


-- 
Dmitry Olshansky
October 24, 2011
On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:

> On 24.10.2011 23:41, Steven Schveighoffer wrote:
>> On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
>> <simen.kjaras@gmail.com> wrote:
>>
>>> On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
>>> <schveiguy@yahoo.com> wrote:
>>>
>>>> On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
>>>> <newshound2@digitalmars.com> wrote:
>>>>
>>>>> On 10/22/2011 2:21 AM, Peter Alexander wrote:
>>>>>> Which operations do you believe would be less efficient?
>>>>>
>>>>> All of the ones that don't require decoding, such as searching,
>>>>> would be less efficient if decoding was done.
>>>>
>>>> Searching that does not do decoding is fundamentally incorrect. That
>>>> is, if you want to find a substring in a string, you cannot just
>>>> compare chars.
>>>
>>> Assuming both string are valid UTF-8, you can. Continuation bytes can
>>> never
>>> be confused with the first byte of a code point, and the first byte
>>> always
>>> identifies how many continuation bytes there should be.
>>>
>>
>> As others have pointed out in the past to me (and I thought as you did
>> once), the same characters can be encoded in *different ways*. They must
>> be normalized to accurately compare.
>>
>
> Assuming language support stays on stage of "codepoint is a character" it's totaly expected to ignore modifiers and compare identically normalized UTF without decoding. Yes, it risks to hit certain issues.

Again, the "risk" is that it fails to achieve the goal you ask of it!

D-language: Here, use this search algorithm, it works most of the time, but may not work correctly in some cases.  If you run into one of those cases, you'll have to run a specialized search algorithm for strings.
User: How do I know I hit one of those cases?
D-language: You'll have to run the specialized version to find out.
User: Why wouldn't I just run the specialized version first?
D-language: Well, because it's slower!
User: But don't I have to use both algorithms to make sure I find the data?
D-language: Only if you "care" about accuracy!

Call me ludicrous, but is this really what we want to push on someone as a "unicode-aware" language?

>
>> Plus, a combining character (such as an umlaut or accent) is part of a
>> character, but may be a separate code point. If that's on the last
>> character in the word such as fiancé, then searching for fiance will
>> result in a match without proper decoding!
>
> Now if you are going to do real characters... If source/needle are normalized you still can avoid lots of work by searching without decoding. All you need to decode is one codepoint on each successful match to see if there is a modifier at end of matched portion.
> But it depends on how you want to match if it's case-insensitive search it will be a lot more complicated, but anyway it boils down to this:
> 1) do inexact search, get likely match ( false positives are OK, negatives not) no decoding here
> 2) once found check it (or parts of it) with proper decoding
>
> There are cultural subtleties, that complicate these steps if you take them into account, but it's doable.

I agree with you that simple searches using only byte (or dchar) comparison does not work, and can be optimized based on several factors.  The easiest thing is to find a code unit sequence that only has one valid form, then search for that without decoding.  Then when found, decode the characters around it.  Or if that isn't possible, create all the un-normalized forms for one grapheme (based on how likely it is to occur), and search for one of those in the undecoded stream.

This can all be built into a specialized string type.  There's actually some really interesting problems to solve in this space I think.  I've created a basic string type that has lamented in my unfinished pile of stuff to do.  I think it can be done in a way that is close to as efficient as arrays for the most common operations (slicing, indexing, etc.), but is *correct* before it is efficient.  You should always be able to drop into "array mode" and deal with the code-units.

> Or if fiancé uses a
>> precomposed é, it won't match. So two valid representations of the word
>> either match or they don't. It's just a complete mess without proper
>> unicode decoding.
>
> It's a complete mess even with proper decoding ;)

Yes, all the more reason to solve the problem correctly so the hapless unicode novice user doesn't have to!

-Steve
October 24, 2011
On 10/24/2011 7:02 AM, Steven Schveighoffer wrote:
> On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright <newshound2@digitalmars.com>
> wrote:
>
>> On 10/22/2011 2:21 AM, Peter Alexander wrote:
>>> Which operations do you believe would be less efficient?
>>
>> All of the ones that don't require decoding, such as searching, would be less
>> efficient if decoding was done.
>
> Searching that does not do decoding is fundamentally incorrect. That is, if you
> want to find a substring in a string, you cannot just compare chars.

Sure you can. A Unicode character is a string, a Unicode string is a string of those strings. So, searching for a Unicode character is searching for a substring.

Doing this character by character is standard, rote stuff. It's how grep works, for example.
October 24, 2011
On 10/24/2011 1:52 PM, Steven Schveighoffer wrote:
> Call me ludicrous, but is this really what we want to push on someone as a
> "unicode-aware" language?

There are different levels of Unicode support. D operates at the most basic level (forgot what it is called) where only recognition of the encodings is necessary.
October 24, 2011
On Mon, 24 Oct 2011 17:27:46 -0400, Walter Bright <newshound2@digitalmars.com> wrote:

> On 10/24/2011 7:02 AM, Steven Schveighoffer wrote:
>> On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright <newshound2@digitalmars.com>
>> wrote:
>>
>>> On 10/22/2011 2:21 AM, Peter Alexander wrote:
>>>> Which operations do you believe would be less efficient?
>>>
>>> All of the ones that don't require decoding, such as searching, would be less
>>> efficient if decoding was done.
>>
>> Searching that does not do decoding is fundamentally incorrect. That is, if you
>> want to find a substring in a string, you cannot just compare chars.
>
> Sure you can. A Unicode character is a string, a Unicode string is a string of those strings. So, searching for a Unicode character is searching for a substring.

What if the source character is encoded differently than the search string?  This is basic unicode stuff.  See my example with fiancé.

-Steve
October 24, 2011
On 10/24/2011 10:52 PM, Steven Schveighoffer wrote:
> On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky
> <dmitry.olsh@gmail.com> wrote:
>
>> On 24.10.2011 23:41, Steven Schveighoffer wrote:
>>> On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
>>> <simen.kjaras@gmail.com> wrote:
>>>
>>>> On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
>>>> <schveiguy@yahoo.com> wrote:
>>>>
>>>>> On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
>>>>> <newshound2@digitalmars.com> wrote:
>>>>>
>>>>>> On 10/22/2011 2:21 AM, Peter Alexander wrote:
>>>>>>> Which operations do you believe would be less efficient?
>>>>>>
>>>>>> All of the ones that don't require decoding, such as searching,
>>>>>> would be less efficient if decoding was done.
>>>>>
>>>>> Searching that does not do decoding is fundamentally incorrect. That
>>>>> is, if you want to find a substring in a string, you cannot just
>>>>> compare chars.
>>>>
>>>> Assuming both string are valid UTF-8, you can. Continuation bytes can
>>>> never
>>>> be confused with the first byte of a code point, and the first byte
>>>> always
>>>> identifies how many continuation bytes there should be.
>>>>
>>>
>>> As others have pointed out in the past to me (and I thought as you did
>>> once), the same characters can be encoded in *different ways*. They must
>>> be normalized to accurately compare.
>>>
>>
>> Assuming language support stays on stage of "codepoint is a character"
>> it's totaly expected to ignore modifiers and compare identically
>> normalized UTF without decoding. Yes, it risks to hit certain issues.
>
> Again, the "risk" is that it fails to achieve the goal you ask of it!
>
> D-language: Here, use this search algorithm, it works most of the time,
> but may not work correctly in some cases. If you run into one of those
> cases, you'll have to run a specialized search algorithm for strings.
> User: How do I know I hit one of those cases?
> D-language: You'll have to run the specialized version to find out.
> User: Why wouldn't I just run the specialized version first?
> D-language: Well, because it's slower!
> User: But don't I have to use both algorithms to make sure I find the data?
> D-language: Only if you "care" about accuracy!
>
> Call me ludicrous, but is this really what we want to push on someone as
> a "unicode-aware" language?
>
>>
>>> Plus, a combining character (such as an umlaut or accent) is part of a
>>> character, but may be a separate code point. If that's on the last
>>> character in the word such as fiancé, then searching for fiance will
>>> result in a match without proper decoding!
>>
>> Now if you are going to do real characters... If source/needle are
>> normalized you still can avoid lots of work by searching without
>> decoding. All you need to decode is one codepoint on each successful
>> match to see if there is a modifier at end of matched portion.
>> But it depends on how you want to match if it's case-insensitive
>> search it will be a lot more complicated, but anyway it boils down to
>> this:
>> 1) do inexact search, get likely match ( false positives are OK,
>> negatives not) no decoding here
>> 2) once found check it (or parts of it) with proper decoding
>>
>> There are cultural subtleties, that complicate these steps if you take
>> them into account, but it's doable.
>
> I agree with you that simple searches using only byte (or dchar)
> comparison does not work, and can be optimized based on several factors.
> The easiest thing is to find a code unit sequence that only has one
> valid form, then search for that without decoding. Then when found,
> decode the characters around it. Or if that isn't possible, create all
> the un-normalized forms for one grapheme (based on how likely it is to
> occur), and search for one of those in the undecoded stream.
>
> This can all be built into a specialized string type. There's actually
> some really interesting problems to solve in this space I think. I've
> created a basic string type that has lamented in my unfinished pile of
> stuff to do. I think it can be done in a way that is close to as
> efficient as arrays for the most common operations (slicing, indexing,
> etc.), but is *correct* before it is efficient. You should always be
> able to drop into "array mode" and deal with the code-units.
>
>> Or if fiancé uses a
>>> precomposed é, it won't match. So two valid representations of the word
>>> either match or they don't. It's just a complete mess without proper
>>> unicode decoding.
>>
>> It's a complete mess even with proper decoding ;)
>
> Yes, all the more reason to solve the problem correctly so the hapless
> unicode novice user doesn't have to!
>
> -Steve

The answer is probably to get proper unicode normalization into Phobos. The efficient versions can then specify that they are guaranteed to work perfectly accurate on normalized input only.