January 13, 2011
On 2011-01-13 06:48:46 -0500, spir <denis.spir@gmail.com> said:

> Note that D's stdlib currently provides no means to do this, not even on the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode library) (good luck ;-). But even ICU, as well as supposed unicode-aware typse or librarys for any language, would give you an abstraction producing correct results for Michel's example. For instance, Python3 code fails as miserably as any other. AFAIK, D is the first and only language having such a tool (Text.d at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).

D is not the first language dealing correctly with Unicode strings in this manner. Objective-C's NSString class search and compare methods deal with characters with combining marks correctly. If you want to compare code points, you can do so explicitly using the NSLiteralSearch option, but the default is to compare the canonical version (at the grapheme level).
<http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/SearchingStrings.html%23//apple_ref/doc/uid/20000149-CJBBGBAI>

In 

Cocoa, string sorting and case-insensitive comparition is also dependent on the user's locale settings, although you can also specify your own locale if the user's locale is not what you want.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 13, 2011
On 2011-01-13 07:10:09 -0500, Jonathan M Davis <jmdavisProg@gmx.com> said:

> However, regardless of what the best way to handle unicode is in general, I think that it's painfully clear that your average programmer doesn't know much about unicode. Even understanding the nuances between char, wchar, and dchar is more than your average programmer seems to understand at first. The idea that a char wouldn't be guaranteed to be an actual character is not something that many
> programmers take to immediately. It's quite foreign to how chars are typically
> dealt with in other languages, and many programmers never worry about unicode at
> all, only dealing with ASCII. So, not only is unicode a rather disgusting problem, but it's not one that your average programmer begins to grasp as far as I've seen. Unless the issue is abstracted away completely, it takes a fair bit of explaining to understand how to deal with unicoder properly.

What's nice about Cocoa's way of handling strings is that even programmers not bothering about it get things right most of the time. Strings are compared in their canonical form (graphemes), unless you request a literal compression; and they are sorted and compared case-insensitively according to the user's locale, unless you specify your own locale settings. Its only major pitfall is that indexing is done on UTF-16 code units.

The cost for this correctness is a small performance penalty, but I think it's the right path to take. For when performance or access to code points is important, the programmer should still be able to go down one layer and play with code points directly.

That said, we need to make sure the performance drop is minimal. I somewhat doubt much that spir's approach of storing strings as an array of piles of characters is the right approach for most usage scenarios, but this area would need a little more research. spir's approach is certainly the ultimate step in correctness as it allows O(1) indexing of graphemes, but personally I'd favor not to have indexing and just do on-the-fly decoding at the grapheme level when performing various string operations.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 13, 2011
OT: Spir, do you know if I can change the syntax highlighting settings on bitbucket? I can't see anything with these gray on dark-gray colors: http://i.imgur.com/SmLk1.jpg
January 13, 2011
On Tue, 11 Jan 2011 18:00:30 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 1/11/11 11:21 AM, Steven Schveighoffer wrote:

>> It is supposed to be simple, and provide the expected interface, without
>> causing any undue performance degradation. That is, I should be able to
>> do all the things with a replacement string type that I can with a char
>> array today, as efficiently as I can today, except I should have to work
>> to get at the code-units. The huge benefit is that I can say "I'm
>> dealing with this as an array" when I know it's safe
>
> Unfinished sentence?

Sorry, I forgot '.' :)

> Anyway, for my money you just described what we have now.

All except the 'expected interface' part.  The string type should deal with dchars exclusively, since that's what it is a range of.  char[] gives you char's back when you index it.  Anyone who doesn't use ASCII will be confused by this.

Also, I expect to be able to use a char[] as an array, which Phobos doesn't let me in some cases (e.g. sorting ASCII character array).

>
>> The disagreement will never be fully solved, as there is just as much
>> disagreement about the current state of affairs ;) e.g. should foreach
>> default to using dchar?
>
> I disagree about the disagreement being unsolvable. I'm not rigid; if I saw a terrific abstraction in your string, I'd be all for it. It just shuffles some issues about, and although I agree it does one thing or two better than char[], at the end of the day it doesn't carry its weight.

I see it as having two vast improvements:

1. If we replace char[] with a specific type for string, then char[] can be considered a true array by phobos, and phobos can now deal with a char[] array without the need to cast.
2. It protects the casual user from incorrectly using a string by making the default the correct API.

Those to me are very important.

>
>> I don't think I'll ever be 'happy' with the way strings sit in phobos
>> currently. I typically deal in ASCII (i.e. code units), and phobos works
>> very hard to prevent that.
>
> I wonder if we could and should extend some of the functions in std.string to work with ubyte[]. I did add a function called representation() that I didn't document yet. Essentially representation gives you the ubyte[], ushort[], or uint[] underneath a string, with the same qualifiers. Whenever you want an algorithm to work on ASCII in earnest, you can pass representation(s) to it instead of s.

This, again, fails on point 2 above.  A char[] is an array, and allows access to code-units, which is not the correct interface for a string.  Supporting ubyte[] doesn't fix that problem.  Correct as the default is usually a theme in D...

> If you work a lot with ASCII, an AsciiString abstraction may be a better and more likely to be successful string type. Better yet, you could simply focus on AsciiChar and then define ASCII strings as arrays of AsciiChar.

This seems like the wrong approach.  Adding a new type does not fix the problems with the original type.  We need to replace the original type or at least how it is treated by the compiler.

-Steve
January 13, 2011
On 1/13/11 8:52 AM, Steven Schveighoffer wrote:
> I see it as having two vast improvements:
>
> 1. If we replace char[] with a specific type for string, then char[] can
> be considered a true array by phobos, and phobos can now deal with a
> char[] array without the need to cast.
> 2. It protects the casual user from incorrectly using a string by making
> the default the correct API.
>
> Those to me are very important.

Let's take a look:

// Incorrect string code
void fun(string s) {
  foreach (i; 0 .. s.length) {
    writeln("The character in position ", i, " is ", s[i]);
  }
}

// Incorrect string_t code
void fun(string_t!char s) {
  foreach (i; 0 .. s.codeUnits) {
    writeln("The character in position ", i, " is ", s[i]);
  }
}

Both functions are incorrect, albeit in different ways. The only improvement I'm seeing is that the user needs to write codeUnits instead of length, which may make her think twice. Clearly, however, copiously incorrect code can be written with the proposed interface because it tries to hide the reality that underneath a variable-length encoding is being used, but doesn't hide it completely (albeit for good efficiency-related reasons).

But wait, there's less. Functions for random-access range throughout Phobos routinely assume fixed-length encoding, i.e. s[i + 1] lies next to s[i]. From a cursory look at string_t, std.range will qualify it as a RandomAccessRange without length. That's an odd beast but does not change the fixed-length encoding assumption. So you'd need to special-case algorithms for string_t, just like right now certain algorithms are specialized for string.

Where's the progress?


Andrei
January 13, 2011
On 01/13/2011 02:47 PM, Michel Fortin wrote:
> On 2011-01-13 06:48:46 -0500, spir <denis.spir@gmail.com> said:
>
>> Note that D's stdlib currently provides no means to do this, not even
>> on the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode
>> library) (good luck ;-). But even ICU, as well as supposed
>> unicode-aware typse or librarys for any language, would give you an
>> abstraction producing correct results for Michel's example. For
>> instance, Python3 code fails as miserably as any other. AFAIK, D is
>> the first and only language having such a tool (Text.d at
>> https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).
>
> D is not the first language dealing correctly with Unicode strings in
> this manner. Objective-C's NSString class search and compare methods
> deal with characters with combining marks correctly. If you want to
> compare code points, you can do so explicitly using the NSLiteralSearch
> option, but the default is to compare the canonical version (at the
> grapheme level).
> <http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/SearchingStrings.html%23//apple_ref/doc/uid/20000149-CJBBGBAI>

Thank you very much for this information (I feel less lonely ;-).
I'll have a look at this NSString class ASAP, looks like it does The-Right-Thing as default (an Apple product...)

> In
> Cocoa, string sorting and case-insensitive comparition is also dependent
> on the user's locale settings, although you can also specify your own
> locale if the user's locale is not what you want.

On this point, I'm more dubitative. (Locale settings do not guarantee anything about right way of sorting for given domain, a given app, a given use case. There is an infinity of potential choices. But maybe it's a right default? See kde trying to invent a, hum, "natural", way of sorting file names...)

Denis
_________________
vita es estrany
spir.wikidot.com

January 13, 2011
On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 1/13/11 8:52 AM, Steven Schveighoffer wrote:
>> I see it as having two vast improvements:
>>
>> 1. If we replace char[] with a specific type for string, then char[] can
>> be considered a true array by phobos, and phobos can now deal with a
>> char[] array without the need to cast.
>> 2. It protects the casual user from incorrectly using a string by making
>> the default the correct API.
>>
>> Those to me are very important.
>
> Let's take a look:
>
> // Incorrect string code
> void fun(string s) {
>    foreach (i; 0 .. s.length) {
>      writeln("The character in position ", i, " is ", s[i]);
>    }
> }
>
> // Incorrect string_t code
> void fun(string_t!char s) {
>    foreach (i; 0 .. s.codeUnits) {
>      writeln("The character in position ", i, " is ", s[i]);
>    }
> }
>
> Both functions are incorrect, albeit in different ways. The only improvement I'm seeing is that the user needs to write codeUnits instead of length, which may make her think twice. Clearly, however, copiously incorrect code can be written with the proposed interface because it tries to hide the reality that underneath a variable-length encoding is being used, but doesn't hide it completely (albeit for good efficiency-related reasons).

You might be looking at my previous version.  The new version (recently posted) will throw an exception for that code if a multi-code-unit code-point is found.

It also supports this:

foreach(i, d; s)
{
   writeln("The character in position ", i, " is ", d);
}

where i is the index (might not be sequential)

> But wait, there's less. Functions for random-access range throughout Phobos routinely assume fixed-length encoding, i.e. s[i + 1] lies next to s[i]. From a cursory look at string_t, std.range will qualify it as a RandomAccessRange without length. That's an odd beast but does not change the fixed-length encoding assumption. So you'd need to special-case algorithms for string_t, just like right now certain algorithms are specialized for string.

isRandomAccessRange requires hasLength (see here: http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532).  This is not a random access range per that definition.  But a string isn't a random access range anyways (it's specifically disallowed by std.range per that same reference).

The plan is you would *not* have to special case algorithms for string_t as you do currently for char[].  If that's not the case, then we haven't achieved much.  Simply put, we are separating out the strange nature of strings from arrays, so the exceptional treatment of them is handled by the type itself, not the functions using it.

-Steve
January 13, 2011
"Andrej Mitrovic" <andrej.mitrovich@gmail.com> wrote in message news:mailman.604.1294932704.4748.digitalmars-d@puremagic.com...
> OT: Spir, do you know if I can change the syntax highlighting settings on bitbucket? I can't see anything with these gray on dark-gray colors: http://i.imgur.com/SmLk1.jpg

I'm getting the same problem too.


January 13, 2011
On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
> On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>> Let's take a look:
>>
>> // Incorrect string code
>> void fun(string s) {
>> foreach (i; 0 .. s.length) {
>> writeln("The character in position ", i, " is ", s[i]);
>> }
>> }
>>
>> // Incorrect string_t code
>> void fun(string_t!char s) {
>> foreach (i; 0 .. s.codeUnits) {
>> writeln("The character in position ", i, " is ", s[i]);
>> }
>> }
>>
>> Both functions are incorrect, albeit in different ways. The only
>> improvement I'm seeing is that the user needs to write codeUnits
>> instead of length, which may make her think twice. Clearly, however,
>> copiously incorrect code can be written with the proposed interface
>> because it tries to hide the reality that underneath a variable-length
>> encoding is being used, but doesn't hide it completely (albeit for
>> good efficiency-related reasons).
>
> You might be looking at my previous version. The new version (recently
> posted) will throw an exception for that code if a multi-code-unit
> code-point is found.

I was looking at your latest. It's code that compiles and runs, but dynamically fails on some inputs. I agree that it's often better to fail noisily instead of silently, but in a manner of speaking the string-based code doesn't fail at all - it correctly iterates the code units of a string. This may sometimes not be what the user expected; most of the time they'd care about the code points.

> It also supports this:
>
> foreach(i, d; s)
> {
> writeln("The character in position ", i, " is ", d);
> }
>
> where i is the index (might not be sequential)

Well string supports that too, albeit with the nit that you need to specify dchar.

>> But wait, there's less. Functions for random-access range throughout
>> Phobos routinely assume fixed-length encoding, i.e. s[i + 1] lies next
>> to s[i]. From a cursory look at string_t, std.range will qualify it as
>> a RandomAccessRange without length. That's an odd beast but does not
>> change the fixed-length encoding assumption. So you'd need to
>> special-case algorithms for string_t, just like right now certain
>> algorithms are specialized for string.
>
> isRandomAccessRange requires hasLength (see here:
> http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532).
> This is not a random access range per that definition.

That's an interesting twist. By the way I specified length is required then because I couldn't imagine having random access into something that I can't tell the length of. Apparently I was wrong :o).

> But a string
> isn't a random access range anyways (it's specifically disallowed by
> std.range per that same reference).

It isn't and it isn't supposed to be.

> The plan is you would *not* have to special case algorithms for string_t
> as you do currently for char[]. If that's not the case, then we haven't
> achieved much. Simply put, we are separating out the strange nature of
> strings from arrays, so the exceptional treatment of them is handled by
> the type itself, not the functions using it.

That sounds reasonable.


Andrei
January 13, 2011
On Thu, 13 Jan 2011 15:51:00 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
>> On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail@erdani.org> wrote:
>>> Let's take a look:
>>>
>>> // Incorrect string code
>>> void fun(string s) {
>>> foreach (i; 0 .. s.length) {
>>> writeln("The character in position ", i, " is ", s[i]);
>>> }
>>> }
>>>
>>> // Incorrect string_t code
>>> void fun(string_t!char s) {
>>> foreach (i; 0 .. s.codeUnits) {
>>> writeln("The character in position ", i, " is ", s[i]);
>>> }
>>> }
>>>
>>> Both functions are incorrect, albeit in different ways. The only
>>> improvement I'm seeing is that the user needs to write codeUnits
>>> instead of length, which may make her think twice. Clearly, however,
>>> copiously incorrect code can be written with the proposed interface
>>> because it tries to hide the reality that underneath a variable-length
>>> encoding is being used, but doesn't hide it completely (albeit for
>>> good efficiency-related reasons).
>>
>> You might be looking at my previous version. The new version (recently
>> posted) will throw an exception for that code if a multi-code-unit
>> code-point is found.
>
> I was looking at your latest. It's code that compiles and runs, but dynamically fails on some inputs. I agree that it's often better to fail noisily instead of silently, but in a manner of speaking the string-based code doesn't fail at all - it correctly iterates the code units of a string. This may sometimes not be what the user expected; most of the time they'd care about the code points.

iterating the code units is possible by accessing the array data.  i.e. you could do:

foreach(i, c; s.data)

if you want the code-units.

That is the point of having a separate type.  Using string_t tells the library "I'm using this data as a string".  Using char[] tells the library "I'm using this data as an array."

The difference here is, you have to *specifically* try to access the code units, the default is code-points.  All it does really is switch the default.

>> It also supports this:
>>
>> foreach(i, d; s)
>> {
>> writeln("The character in position ", i, " is ", d);
>> }
>>
>> where i is the index (might not be sequential)
>
> Well string supports that too, albeit with the nit that you need to specify dchar.

This is not a small problem.

>> isRandomAccessRange requires hasLength (see here:
>> http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532).
>> This is not a random access range per that definition.
>
> That's an interesting twist. By the way I specified length is required then because I couldn't imagine having random access into something that I can't tell the length of. Apparently I was wrong :o).

Yes, in fact, you could say that specifically defines VLERange ;)  But actually, there are two types of VLE ranges, those which can be randomly accessed (where determining the beginning of a code point, given a random index is possible) and those that cannot (where decoding depends on the exact order of the data).  Actually, those would not be bi-directional ranges anyways.

>> But a string
>> isn't a random access range anyways (it's specifically disallowed by
>> std.range per that same reference).
>
> It isn't and it isn't supposed to be.

I agree with that assessment, which is why I omitted length.

-Steve