October 24, 2011
On 25.10.2011 0:52, Steven Schveighoffer wrote:
> On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky
> <dmitry.olsh@gmail.com> wrote:
>
>> On 24.10.2011 23:41, Steven Schveighoffer wrote:
>>> On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
>>> <simen.kjaras@gmail.com> wrote:
>>>
>>>> On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
>>>> <schveiguy@yahoo.com> wrote:
>>>>
>>>>> On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
>>>>> <newshound2@digitalmars.com> wrote:
>>>>>
>>>>>> On 10/22/2011 2:21 AM, Peter Alexander wrote:
>>>>>>> Which operations do you believe would be less efficient?
>>>>>>
>>>>>> All of the ones that don't require decoding, such as searching,
>>>>>> would be less efficient if decoding was done.
>>>>>
>>>>> Searching that does not do decoding is fundamentally incorrect. That
>>>>> is, if you want to find a substring in a string, you cannot just
>>>>> compare chars.
>>>>
>>>> Assuming both string are valid UTF-8, you can. Continuation bytes can
>>>> never
>>>> be confused with the first byte of a code point, and the first byte
>>>> always
>>>> identifies how many continuation bytes there should be.
>>>>
>>>
>>> As others have pointed out in the past to me (and I thought as you did
>>> once), the same characters can be encoded in *different ways*. They must
>>> be normalized to accurately compare.
>>>
>>
>> Assuming language support stays on stage of "codepoint is a character"
>> it's totaly expected to ignore modifiers and compare identically
>> normalized UTF without decoding. Yes, it risks to hit certain issues.
>
> Again, the "risk" is that it fails to achieve the goal you ask of it!
>

Last time I checked, the legion of "everything is ASCII" zealots was pretty large ;) So the "goal" is a blury line, meaning that "search for a substring on in this chunk of unicode text" can mean very different things, and be correct in it's own sense on about 3 different levels.
There is no default way, so I'm not sure we have to embed all of this complexity into language right now. Phobos is were we must first implement this, once it works we may look at revisiting our pick of default comparison method, etc.

> D-language: Here, use this search algorithm, it works most of the time,
> but may not work correctly in some cases. If you run into one of those
> cases, you'll have to run a specialized search algorithm for strings.
> User: How do I know I hit one of those cases?
> D-language: You'll have to run the specialized version to find out.
> User: Why wouldn't I just run the specialized version first?
> D-language: Well, because it's slower!
> User: But don't I have to use both algorithms to make sure I find the data?
> D-language: Only if you "care" about accuracy!
>
> Call me ludicrous, but is this really what we want to push on someone as
> a "unicode-aware" language?
>

No, but we might want to fix a string search to do a little more - namely check if it skewed a graphem (assuming normalizations match). BTW adding normalization to std.uni is another thing to do right now. That should be a good enough start, and looking at unicode standard, things are rather fluid there meaning that "unicode-aware" language should be sufixed by version number :)

>>
>>> Plus, a combining character (such as an umlaut or accent) is part of a
>>> character, but may be a separate code point. If that's on the last
>>> character in the word such as fiancé, then searching for fiance will
>>> result in a match without proper decoding!
>>
>> Now if you are going to do real characters... If source/needle are
>> normalized you still can avoid lots of work by searching without
>> decoding. All you need to decode is one codepoint on each successful
>> match to see if there is a modifier at end of matched portion.
>> But it depends on how you want to match if it's case-insensitive
>> search it will be a lot more complicated, but anyway it boils down to
>> this:
>> 1) do inexact search, get likely match ( false positives are OK,
>> negatives not) no decoding here
>> 2) once found check it (or parts of it) with proper decoding
>>
>> There are cultural subtleties, that complicate these steps if you take
>> them into account, but it's doable.
>
> I agree with you that simple searches using only byte (or dchar)
> comparison does not work, and can be optimized based on several factors.
> The easiest thing is to find a code unit sequence that only has one
> valid form, then search for that without decoding. Then when found,
> decode the characters around it. Or if that isn't possible, create all
> the un-normalized forms for one grapheme (based on how likely it is to
> occur), and search for one of those in the undecoded stream.
>

Yes, it's all revolves around find all of variations of a substring.
If user is willing to spend some time upfront, this could be done in one pass e.g. using the trick I employed in new std.regex.
For a rough idea see ShiftOr string search,
that is easy & fast inexact search:
http://igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060

There are even better variations of this stuff in the wild.

> This can all be built into a specialized string type. There's actually
> some really interesting problems to solve in this space I think. I've
> created a basic string type that has lamented in my unfinished pile of
> stuff to do. I think it can be done in a way that is close to as
> efficient as arrays for the most common operations (slicing, indexing,
> etc.), but is *correct* before it is efficient. You should always be
> able to drop into "array mode" and deal with the code-units.
>

Here is where we probably differ, to me there is no simple safe fast string type. Why? Because if you need anything "not pedantic and safe above all" you work with implicit data type of e.g. "a chunk of UTF-8 bytes that are normalized in NFC", ...
I'd rather see them as sealed array with a bunch of meta info, and string functions to specialize on them in order to do some cool speed hacks. Of course, drilling down to row data should be possible for the user as well.

>> Or if fiancé uses a
>>> precomposed é, it won't match. So two valid representations of the word
>>> either match or they don't. It's just a complete mess without proper
>>> unicode decoding.
>>
>> It's a complete mess even with proper decoding ;)
>
> Yes, all the more reason to solve the problem correctly so the hapless
> unicode novice user doesn't have to!
>


Hapless novice unicode user don't stand a chance.
I'm serious. One needs to know a lot of these "basics" to e.g. know why indexing by "true" character could be so slow, about encodings, normalizations etc. Yet we can save a metric ton of special cased crap from ever hitting his eyes.


-- 
Dmitry Olshansky
October 24, 2011
On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer <schveiguy@yahoo.com> wrote:

> Plus, a combining character (such as an umlaut or accent) is part of a
> character, but may be a separate code point.

If this is correct (and it is), then decoding to dchar is simply not enough.
You seem to advocate decoding to graphemes, which is a whole different matter.


-- 
  Simen
October 25, 2011
On 2011-10-24 21:47:15 +0000, "Steven Schveighoffer" <schveiguy@yahoo.com> said:

> What if the source character is encoded differently than the search
> string?  This is basic unicode stuff.  See my example with fiancé.

The more I think about it, the more I think it should work like this: just like we assume they contain well-formed UTF sequences, char[], wchar[], and dchar[] should also be assumed to contain **normalized** unicode strings. Which normalization form to use? no idea. Just pick a sensible one in the four.
<http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms>

Once we know all strings are normalized in the same way, we can then compare two strings bitwise to check if they're the same. And we can check for a substring in the same manner except we need to insert a simple check after the match to verify that it isn't surrounded by combining marks applying to the first or last character. And it'll also deeply simplify any proper Unicode string handling code we'll add in the future.

- - -

That said, I fear that forcing a specific normalization might be problematic. You don't always want to have to normalize everything...
<http://en.wikipedia.org/wiki/Unicode_equivalence#Errors_due_to_normalization_differences>

So 

perhaps we could simplify things a bit more: don't pick a standard normalization form. Just assume that both strings being used are in the same normalization form. Comparison will work, searching for substring in the way specified above will work, and other functions could document which normalization form they accept. Problem solved... somewhat.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

October 26, 2011
On Mon, 24 Oct 2011 19:58:59 -0400, Michel Fortin <michel.fortin@michelf.com> wrote:

> On 2011-10-24 21:47:15 +0000, "Steven Schveighoffer" <schveiguy@yahoo.com> said:
>
>> What if the source character is encoded differently than the search
>> string?  This is basic unicode stuff.  See my example with fiancé.
>
> The more I think about it, the more I think it should work like this: just like we assume they contain well-formed UTF sequences, char[], wchar[], and dchar[] should also be assumed to contain **normalized** unicode strings. Which normalization form to use? no idea. Just pick a sensible one in the four.
> <http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms>
>
> Once we know all strings are normalized in the same way, we can then compare two strings bitwise to check if they're the same. And we can check for a substring in the same manner except we need to insert a simple check after the match to verify that it isn't surrounded by combining marks applying to the first or last character. And it'll also deeply simplify any proper Unicode string handling code we'll add in the future.
>
>  - - -
>
> That said, I fear that forcing a specific normalization might be problematic. You don't always want to have to normalize everything...
> <http://en.wikipedia.org/wiki/Unicode_equivalence#Errors_due_to_normalization_differences>
>
> So perhaps we could simplify things a bit more: don't pick a standard normalization form. Just assume that both strings being used are in the same normalization form. Comparison will work, searching for substring in the way specified above will work, and other functions could document which normalization form they accept. Problem solved... somewhat.

It's even easier than this:

a) you want to do a proper string comparison not knowing what state the unicode strings are in, use the full-fledged decode-when-needed string type, and its associated str.find method.
b) you know they are both the same normalized form and want to optimize, use std.algorithm.find(haystack.asArray, needle.asArray).

-Steve
October 26, 2011
On Mon, 24 Oct 2011 18:25:25 -0400, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:

> On 25.10.2011 0:52, Steven Schveighoffer wrote:
>> On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky
>> <dmitry.olsh@gmail.com> wrote:
>>
>>> On 24.10.2011 23:41, Steven Schveighoffer wrote:
>>>> On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
>>>> <simen.kjaras@gmail.com> wrote:
>>>>
>>>>> On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
>>>>> <schveiguy@yahoo.com> wrote:
>>>>>
>>>>>> On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
>>>>>> <newshound2@digitalmars.com> wrote:
>>>>>>
>>>>>>> On 10/22/2011 2:21 AM, Peter Alexander wrote:
>>>>>>>> Which operations do you believe would be less efficient?
>>>>>>>
>>>>>>> All of the ones that don't require decoding, such as searching,
>>>>>>> would be less efficient if decoding was done.
>>>>>>
>>>>>> Searching that does not do decoding is fundamentally incorrect. That
>>>>>> is, if you want to find a substring in a string, you cannot just
>>>>>> compare chars.
>>>>>
>>>>> Assuming both string are valid UTF-8, you can. Continuation bytes can
>>>>> never
>>>>> be confused with the first byte of a code point, and the first byte
>>>>> always
>>>>> identifies how many continuation bytes there should be.
>>>>>
>>>>
>>>> As others have pointed out in the past to me (and I thought as you did
>>>> once), the same characters can be encoded in *different ways*. They must
>>>> be normalized to accurately compare.
>>>>
>>>
>>> Assuming language support stays on stage of "codepoint is a character"
>>> it's totaly expected to ignore modifiers and compare identically
>>> normalized UTF without decoding. Yes, it risks to hit certain issues.
>>
>> Again, the "risk" is that it fails to achieve the goal you ask of it!
>>
>
> Last time I checked, the legion of "everything is ASCII" zealots was pretty large ;) So the "goal" is a blury line, meaning that "search for a substring on in this chunk of unicode text" can mean very different things, and be correct in it's own sense on about 3 different levels.
> There is no default way, so I'm not sure we have to embed all of this complexity into language right now. Phobos is were we must first implement this, once it works we may look at revisiting our pick of default comparison method, etc.

If we say D obeys the subset of unicode that only works on ascii characters, well, then I think we do not support unicode at all ;)

I agree the string type needs to be fleshed out before the language adopts it.  I think we could legitimately define a string type that auto-assigns from it's respective char[] array type.  Once it's shown that the type is nearly as fast as a char[] array, it can be used as the default type for string literals (the only real place where the language gets in the way).

>> D-language: Here, use this search algorithm, it works most of the time,
>> but may not work correctly in some cases. If you run into one of those
>> cases, you'll have to run a specialized search algorithm for strings.
>> User: How do I know I hit one of those cases?
>> D-language: You'll have to run the specialized version to find out.
>> User: Why wouldn't I just run the specialized version first?
>> D-language: Well, because it's slower!
>> User: But don't I have to use both algorithms to make sure I find the data?
>> D-language: Only if you "care" about accuracy!
>>
>> Call me ludicrous, but is this really what we want to push on someone as
>> a "unicode-aware" language?
>>
>
> No, but we might want to fix a string search to do a little more - namely check if it skewed a graphem (assuming normalizations match).

That's a big assumption.  Valid unicode is valid unicode, even if it's not normalized.

> BTW adding normalization to std.uni is another thing to do right now. That should be a good enough start, and looking at unicode standard, things are rather fluid there meaning that "unicode-aware" language should be sufixed by version number :)

I agree we need normalization and it is not necessary to involve the compiler in this.  I'd suggest a two to three phase approach:

1. Leave phobos' schizo view of "char arrays are not arrays" for now, and build the necessary framework to get a string type that actually works.
2. Remove schizo view.
3. (optional) make compiler use library-defined string type for string literals.

>
>>>
>>>> Plus, a combining character (such as an umlaut or accent) is part of a
>>>> character, but may be a separate code point. If that's on the last
>>>> character in the word such as fiancé, then searching for fiance will
>>>> result in a match without proper decoding!
>>>
>>> Now if you are going to do real characters... If source/needle are
>>> normalized you still can avoid lots of work by searching without
>>> decoding. All you need to decode is one codepoint on each successful
>>> match to see if there is a modifier at end of matched portion.
>>> But it depends on how you want to match if it's case-insensitive
>>> search it will be a lot more complicated, but anyway it boils down to
>>> this:
>>> 1) do inexact search, get likely match ( false positives are OK,
>>> negatives not) no decoding here
>>> 2) once found check it (or parts of it) with proper decoding
>>>
>>> There are cultural subtleties, that complicate these steps if you take
>>> them into account, but it's doable.
>>
>> I agree with you that simple searches using only byte (or dchar)
>> comparison does not work, and can be optimized based on several factors.
>> The easiest thing is to find a code unit sequence that only has one
>> valid form, then search for that without decoding. Then when found,
>> decode the characters around it. Or if that isn't possible, create all
>> the un-normalized forms for one grapheme (based on how likely it is to
>> occur), and search for one of those in the undecoded stream.
>>
>
> Yes, it's all revolves around find all of variations of a substring.
> If user is willing to spend some time upfront, this could be done in one pass e.g. using the trick I employed in new std.regex.
> For a rough idea see ShiftOr string search,
> that is easy & fast inexact search:
> http://igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060
>
> There are even better variations of this stuff in the wild.

This sounds good.  I'll have a look at it when I have time.

>
>> This can all be built into a specialized string type. There's actually
>> some really interesting problems to solve in this space I think. I've
>> created a basic string type that has lamented in my unfinished pile of
>> stuff to do. I think it can be done in a way that is close to as
>> efficient as arrays for the most common operations (slicing, indexing,
>> etc.), but is *correct* before it is efficient. You should always be
>> able to drop into "array mode" and deal with the code-units.
>>
>
> Here is where we probably differ, to me there is no simple safe fast string type. Why? Because if you need anything "not pedantic and safe above all" you work with implicit data type of e.g. "a chunk of UTF-8 bytes that are normalized in NFC", ...

Specialized types which dictate normalization could be created.  We do not have to always assume the worst if we are optimizing.

But in general, when you read a utf-8 file, it could have anything in it.  It may not even be normalized.

> I'd rather see them as sealed array with a bunch of meta info, and string functions to specialize on them in order to do some cool speed hacks. Of course, drilling down to row data should be possible for the user as well.

This is exactly what I am envisioning ;)  Except I'd build the meta-info into the type.

>>> Or if fiancé uses a
>>>> precomposed é, it won't match. So two valid representations of the word
>>>> either match or they don't. It's just a complete mess without proper
>>>> unicode decoding.
>>>
>>> It's a complete mess even with proper decoding ;)
>>
>> Yes, all the more reason to solve the problem correctly so the hapless
>> unicode novice user doesn't have to!
>>
>
>
> Hapless novice unicode user don't stand a chance.
> I'm serious. One needs to know a lot of these "basics" to e.g. know why indexing by "true" character could be so slow, about encodings, normalizations etc. Yet we can save a metric ton of special cased crap from ever hitting his eyes.

I disagree.  99% of the time I use strings, I don't care about indexes.  I just want to deal with whole strings and substrings.  I rarely use arbitrary indexes, and when I do, I'm assuming ascii data.

The hapless novice unicode user doesn't need to know unicode to do:

writefln("hello, world!");

Or to do:

if(find(str, "xyz")) {...}

Or to even use regex!

-Steve
October 26, 2011
On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas <simen.kjaras@gmail.com> wrote:

> On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer <schveiguy@yahoo.com> wrote:
>
>> Plus, a combining character (such as an umlaut or accent) is part of a
>> character, but may be a separate code point.
>
> If this is correct (and it is), then decoding to dchar is simply not enough.
> You seem to advocate decoding to graphemes, which is a whole different matter.

I am advocating that.  And it's a matter of perception.  D can say "we only support code-point decoding" and what that means to a user is, "we don't support language as you know it."  Sure it's a part of unicode, but it takes that extra piece to make it actually usable to people who require unicode.

Even in English, fiancé has an accent.  To say D supports unicode, but then won't do a simple search on a file which contains a certain *valid* encoding of that word is disingenuous to say the least.

D needs a fully unicode-aware string type.  I advocate D should use it as the default string type, but it needs one whether it's the default or not in order to say it supports unicode.

-Steve
October 26, 2011
On 2011-10-26 11:50:32 +0000, "Steven Schveighoffer" <schveiguy@yahoo.com> said:

> It's even easier than this:
> 
> a) you want to do a proper string comparison not knowing what state the
> unicode strings are in, use the full-fledged decode-when-needed string
> type, and its associated str.find method.
> b) you know they are both the same normalized form and want to optimize,
> use std.algorithm.find(haystack.asArray, needle.asArray).

Well, treating the string as an array of dchar doesn't work in the general case, even with strings normalized the same way your fiancé example can break. So should never treat them as plain arrays unless I'm sure I have no combining marks in the string.

I'm not opposed to having a new string type being developed, but I'm skeptical about its inclusion in the language. We already have three string types which can be assumed to contain valid UTF sequences. I think the first thing to do is not to develop a new string type, but to develop the normalization and grapheme splitting algorithms, and those to find a substring, using the existing char[], wchar[] and dchar[] types.  Then write a program with proper handling of Unicode using those and hand-optimize it. If that proves to be a pain (it might well be), write a new string type, rewrite the program using it, do some benchmarks and then we'll know if it's a good idea, and will be able to quantify the drawbacks.

But right now, all this arguing for or against a new string type is stacking hypothesis against other hypothesis, it won't lead anywhere.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

October 26, 2011
On Wed, 26 Oct 2011 10:04:08 -0400, Michel Fortin <michel.fortin@michelf.com> wrote:

> On 2011-10-26 11:50:32 +0000, "Steven Schveighoffer" <schveiguy@yahoo.com> said:
>
>> It's even easier than this:
>>  a) you want to do a proper string comparison not knowing what state the
>> unicode strings are in, use the full-fledged decode-when-needed string
>> type, and its associated str.find method.
>> b) you know they are both the same normalized form and want to optimize,
>> use std.algorithm.find(haystack.asArray, needle.asArray).
>
> Well, treating the string as an array of dchar doesn't work in the general case, even with strings normalized the same way your fiancé example can break. So should never treat them as plain arrays unless I'm sure I have no combining marks in the string.

If you have the same normalization, you would at least find all the valid instances.  You'd have to eliminate false positives, by decoding the next dchar after the match to see if it's a combining character.  This would be a simple function to write.

All I'm saying is, doing a search on the array should be available.

>
> I'm not opposed to having a new string type being developed, but I'm skeptical about its inclusion in the language. We already have three string types which can be assumed to contain valid UTF sequences. I think the first thing to do is not to develop a new string type, but to develop the normalization and grapheme splitting algorithms, and those to find a substring, using the existing char[], wchar[] and dchar[] types.  Then write a program with proper handling of Unicode using those and hand-optimize it. If that proves to be a pain (it might well be), write a new string type, rewrite the program using it, do some benchmarks and then we'll know if it's a good idea, and will be able to quantify the drawbacks.
>
> But right now, all this arguing for or against a new string type is stacking hypothesis against other hypothesis, it won't lead anywhere.
>

I have a half-implemented string type which does just this.  I need to finish it.  There are some language barriers that need to be sorted out (such as how to deal with tail-const for arbitrary types).

I think a user who no longer posts here (spir) had some full implementation of unicode strings using classes, though I don't see that being comparable in performance.

-Steve
October 26, 2011
On 2011-10-26 14:20:54 +0000, "Steven Schveighoffer" <schveiguy@yahoo.com> said:

> If you have the same normalization, you would at least find all the vali d
> instances.  You'd have to eliminate false positives, by decoding the nex t
> dchar after the match to see if it's a combining character.  This would  be
> a simple function to write.
> 
> All I'm saying is, doing a search on the array should be available.

And all I'm saying is that such a standalone function that eliminates the false positive should be available outside of a string type too. You'll likely need it for your string type anyway, but there's no reason you should't be able to use it standalone on a char[]/wchar[]/dchar[].

I think most string algorithms should be able to work with normalized character arrays; a new string type should just be a shell around these that makes things easier to the user.


> I have a half-implemented string type which does just this.  I need to
> finish it.

That should be interesting.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

October 26, 2011
On 26.10.2011 16:12, Steven Schveighoffer wrote:
<snip>
>
> I agree the string type needs to be fleshed out before the language
> adopts it. I think we could legitimately define a string type that
> auto-assigns from it's respective char[] array type. Once it's shown
> that the type is nearly as fast as a char[] array, it can be used as the
> default type for string literals (the only real place where the language
> gets in the way).
>

That's it.

>>> D-language: Here, use this search algorithm, it works most of the time,
>>> but may not work correctly in some cases. If you run into one of those
>>> cases, you'll have to run a specialized search algorithm for strings.
>>> User: How do I know I hit one of those cases?
>>> D-language: You'll have to run the specialized version to find out.
>>> User: Why wouldn't I just run the specialized version first?
>>> D-language: Well, because it's slower!
>>> User: But don't I have to use both algorithms to make sure I find the
>>> data?
>>> D-language: Only if you "care" about accuracy!
>>>
>>> Call me ludicrous, but is this really what we want to push on someone as
>>> a "unicode-aware" language?
>>>
>>
>> No, but we might want to fix a string search to do a little more -
>> namely check if it skewed a graphem (assuming normalizations match).
>
> That's a big assumption. Valid unicode is valid unicode, even if it's
> not normalized.
>
>> BTW adding normalization to std.uni is another thing to do right now.
>> That should be a good enough start, and looking at unicode standard,
>> things are rather fluid there meaning that "unicode-aware" language
>> should be sufixed by version number :)
>
> I agree we need normalization and it is not necessary to involve the
> compiler in this. I'd suggest a two to three phase approach:
>
> 1. Leave phobos' schizo view of "char arrays are not arrays" for now,
> and build the necessary framework to get a string type that actually works.
> 2. Remove schizo view.
> 3. (optional) make compiler use library-defined string type for string
> literals.
>

Something like that, I may land a hand here, as I plan to merge some unicode stuff between std.regex and std.uni. As extras probably striding by grapheme and support for various degree of case-insensitive string comparison.

>>
>>>>
>>>>> Plus, a combining character (such as an umlaut or accent) is part of a
>>>>> character, but may be a separate code point. If that's on the last
>>>>> character in the word such as fiancé, then searching for fiance will
>>>>> result in a match without proper decoding!
>>>>
>>>> Now if you are going to do real characters... If source/needle are
>>>> normalized you still can avoid lots of work by searching without
>>>> decoding. All you need to decode is one codepoint on each successful
>>>> match to see if there is a modifier at end of matched portion.
>>>> But it depends on how you want to match if it's case-insensitive
>>>> search it will be a lot more complicated, but anyway it boils down to
>>>> this:
>>>> 1) do inexact search, get likely match ( false positives are OK,
>>>> negatives not) no decoding here
>>>> 2) once found check it (or parts of it) with proper decoding
>>>>
>>>> There are cultural subtleties, that complicate these steps if you take
>>>> them into account, but it's doable.
>>>
>>> I agree with you that simple searches using only byte (or dchar)
>>> comparison does not work, and can be optimized based on several factors.
>>> The easiest thing is to find a code unit sequence that only has one
>>> valid form, then search for that without decoding. Then when found,
>>> decode the characters around it. Or if that isn't possible, create all
>>> the un-normalized forms for one grapheme (based on how likely it is to
>>> occur), and search for one of those in the undecoded stream.
>>>
>>
>> Yes, it's all revolves around find all of variations of a substring.
>> If user is willing to spend some time upfront, this could be done in
>> one pass e.g. using the trick I employed in new std.regex.
>> For a rough idea see ShiftOr string search,
>> that is easy & fast inexact search:
>> http://igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060
>>
>> There are even better variations of this stuff in the wild.
>
> This sounds good. I'll have a look at it when I have time.
>
>>
>>> This can all be built into a specialized string type. There's actually
>>> some really interesting problems to solve in this space I think. I've
>>> created a basic string type that has lamented in my unfinished pile of
>>> stuff to do. I think it can be done in a way that is close to as
>>> efficient as arrays for the most common operations (slicing, indexing,
>>> etc.), but is *correct* before it is efficient. You should always be
>>> able to drop into "array mode" and deal with the code-units.
>>>
>>
>> Here is where we probably differ, to me there is no simple safe fast
>> string type. Why? Because if you need anything "not pedantic and safe
>> above all" you work with implicit data type of e.g. "a chunk of UTF-8
>> bytes that are normalized in NFC", ...
>
> Specialized types which dictate normalization could be created. We do
> not have to always assume the worst if we are optimizing.
>
> But in general, when you read a utf-8 file, it could have anything in
> it. It may not even be normalized.
>

There should be a way of e.g.
auto my_string = normalize(readText("file.txt"), Normalize.NFKC);
//my_string has typed normalized UTF-8 or somesuch
Yes, that should keep general non-assuming case. Though it still can do some clever things, since as W3C points out >90% of text on web is already normalized in NFC (I can search for a proof link, if hard pressed).

>> I'd rather see them as sealed array with a bunch of meta info, and
>> string functions to specialize on them in order to do some cool speed
>> hacks. Of course, drilling down to row data should be possible for the
>> user as well.
>
> This is exactly what I am envisioning ;) Except I'd build the meta-info
> into the type.
>

Yeah, I meant that the type caries this info.
Looks like we are more in agreement then I suspected ;)

>>>> Or if fiancé uses a
>>>>> precomposed é, it won't match. So two valid representations of the
>>>>> word
>>>>> either match or they don't. It's just a complete mess without proper
>>>>> unicode decoding.
>>>>
>>>> It's a complete mess even with proper decoding ;)
>>>
>>> Yes, all the more reason to solve the problem correctly so the hapless
>>> unicode novice user doesn't have to!
>>>
>>
>>
>> Hapless novice unicode user don't stand a chance.
>> I'm serious. One needs to know a lot of these "basics" to e.g. know
>> why indexing by "true" character could be so slow, about encodings,
>> normalizations etc. Yet we can save a metric ton of special cased crap
>> from ever hitting his eyes.
>
> I disagree. 99% of the time I use strings, I don't care about indexes. I
> just want to deal with whole strings and substrings. I rarely use
> arbitrary indexes, and when I do, I'm assuming ascii data.
>
> The hapless novice unicode user doesn't need to know unicode to do:
>
> writefln("hello, world!");
>
> Or to do:
>
> if(find(str, "xyz")) {...}
>
> Or to even use regex!
>

To print chars you don't need to know unicode,  you OS needs to :) Though that wasn't string processing. However find/match is indeed a good example. You are probably right, if we cover enough of common operations most people might not have to "look inside".


-- 
Dmitry Olshansky