View mode: basic / threaded / horizontal-split · Log in · Help
January 13, 2011
Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]
On 2011-01-13 06:48:46 -0500, spir <denis.spir@gmail.com> said:

> Note that D's stdlib currently provides no means to do this, not even 
> on the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode 
> library) (good luck ;-). But even ICU, as well as supposed 
> unicode-aware typse or librarys for any language, would give you an 
> abstraction producing correct results for Michel's example. For 
> instance, Python3 code fails as miserably as any other. AFAIK, D is the 
> first and only language having such a tool (Text.d at 
> https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).

D is not the first language dealing correctly with Unicode strings in 
this manner. Objective-C's NSString class search and compare methods 
deal with characters with combining marks correctly. If you want to 
compare code points, you can do so explicitly using the NSLiteralSearch 
option, but the default is to compare the canonical version (at the 
grapheme level).
<http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/SearchingStrings.html%23//apple_ref/doc/uid/20000149-CJBBGBAI>

In 

Cocoa, string sorting and case-insensitive comparition is also 
dependent on the user's locale settings, although you can also specify 
your own locale if the user's locale is not what you want.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 13, 2011
Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]
On 2011-01-13 07:10:09 -0500, Jonathan M Davis <jmdavisProg@gmx.com> said:

> However, regardless of what the best way to handle unicode is in 
> general, I think that it's painfully clear that your average programmer 
> doesn't know much about unicode. Even understanding the nuances between 
> char, wchar, and dchar is more than your average programmer seems to 
> understand at first. The idea that a char wouldn't be guaranteed to be 
> an actual character is not something that many
> programmers take to immediately. It's quite foreign to how chars are typically
> dealt with in other languages, and many programmers never worry about 
> unicode at
> all, only dealing with ASCII. So, not only is unicode a rather 
> disgusting problem, but it's not one that your average programmer 
> begins to grasp as far as I've seen. Unless the issue is abstracted 
> away completely, it takes a fair bit of explaining to understand how to 
> deal with unicoder properly.

What's nice about Cocoa's way of handling strings is that even 
programmers not bothering about it get things right most of the time. 
Strings are compared in their canonical form (graphemes), unless you 
request a literal compression; and they are sorted and compared 
case-insensitively according to the user's locale, unless you specify 
your own locale settings. Its only major pitfall is that indexing is 
done on UTF-16 code units.

The cost for this correctness is a small performance penalty, but I 
think it's the right path to take. For when performance or access to 
code points is important, the programmer should still be able to go 
down one layer and play with code points directly.

That said, we need to make sure the performance drop is minimal. I 
somewhat doubt much that spir's approach of storing strings as an array 
of piles of characters is the right approach for most usage scenarios, 
but this area would need a little more research. spir's approach is 
certainly the ultimate step in correctness as it allows O(1) indexing 
of graphemes, but personally I'd favor not to have indexing and just do 
on-the-fly decoding at the grapheme level when performing various 
string operations.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 13, 2011
Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]
OT: Spir, do you know if I can change the syntax highlighting settings
on bitbucket? I can't see anything with these gray on dark-gray
colors: http://i.imgur.com/SmLk1.jpg
January 13, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Tue, 11 Jan 2011 18:00:30 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail@erdani.org> wrote:

> On 1/11/11 11:21 AM, Steven Schveighoffer wrote:

>> It is supposed to be simple, and provide the expected interface, without
>> causing any undue performance degradation. That is, I should be able to
>> do all the things with a replacement string type that I can with a char
>> array today, as efficiently as I can today, except I should have to work
>> to get at the code-units. The huge benefit is that I can say "I'm
>> dealing with this as an array" when I know it's safe
>
> Unfinished sentence?

Sorry, I forgot '.' :)

> Anyway, for my money you just described what we have now.

All except the 'expected interface' part.  The string type should deal  
with dchars exclusively, since that's what it is a range of.  char[] gives  
you char's back when you index it.  Anyone who doesn't use ASCII will be  
confused by this.

Also, I expect to be able to use a char[] as an array, which Phobos  
doesn't let me in some cases (e.g. sorting ASCII character array).

>
>> The disagreement will never be fully solved, as there is just as much
>> disagreement about the current state of affairs ;) e.g. should foreach
>> default to using dchar?
>
> I disagree about the disagreement being unsolvable. I'm not rigid; if I  
> saw a terrific abstraction in your string, I'd be all for it. It just  
> shuffles some issues about, and although I agree it does one thing or  
> two better than char[], at the end of the day it doesn't carry its  
> weight.

I see it as having two vast improvements:

1. If we replace char[] with a specific type for string, then char[] can  
be considered a true array by phobos, and phobos can now deal with a  
char[] array without the need to cast.
2. It protects the casual user from incorrectly using a string by making  
the default the correct API.

Those to me are very important.

>
>> I don't think I'll ever be 'happy' with the way strings sit in phobos
>> currently. I typically deal in ASCII (i.e. code units), and phobos works
>> very hard to prevent that.
>
> I wonder if we could and should extend some of the functions in  
> std.string to work with ubyte[]. I did add a function called  
> representation() that I didn't document yet. Essentially representation  
> gives you the ubyte[], ushort[], or uint[] underneath a string, with the  
> same qualifiers. Whenever you want an algorithm to work on ASCII in  
> earnest, you can pass representation(s) to it instead of s.

This, again, fails on point 2 above.  A char[] is an array, and allows  
access to code-units, which is not the correct interface for a string.   
Supporting ubyte[] doesn't fix that problem.  Correct as the default is  
usually a theme in D...

> If you work a lot with ASCII, an AsciiString abstraction may be a better  
> and more likely to be successful string type. Better yet, you could  
> simply focus on AsciiChar and then define ASCII strings as arrays of  
> AsciiChar.

This seems like the wrong approach.  Adding a new type does not fix the  
problems with the original type.  We need to replace the original type or  
at least how it is treated by the compiler.

-Steve
January 13, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/13/11 8:52 AM, Steven Schveighoffer wrote:
> I see it as having two vast improvements:
>
> 1. If we replace char[] with a specific type for string, then char[] can
> be considered a true array by phobos, and phobos can now deal with a
> char[] array without the need to cast.
> 2. It protects the casual user from incorrectly using a string by making
> the default the correct API.
>
> Those to me are very important.

Let's take a look:

// Incorrect string code
void fun(string s) {
  foreach (i; 0 .. s.length) {
    writeln("The character in position ", i, " is ", s[i]);
  }
}

// Incorrect string_t code
void fun(string_t!char s) {
  foreach (i; 0 .. s.codeUnits) {
    writeln("The character in position ", i, " is ", s[i]);
  }
}

Both functions are incorrect, albeit in different ways. The only 
improvement I'm seeing is that the user needs to write codeUnits instead 
of length, which may make her think twice. Clearly, however, copiously 
incorrect code can be written with the proposed interface because it 
tries to hide the reality that underneath a variable-length encoding is 
being used, but doesn't hide it completely (albeit for good 
efficiency-related reasons).

But wait, there's less. Functions for random-access range throughout 
Phobos routinely assume fixed-length encoding, i.e. s[i + 1] lies next 
to s[i]. From a cursory look at string_t, std.range will qualify it as a 
RandomAccessRange without length. That's an odd beast but does not 
change the fixed-length encoding assumption. So you'd need to 
special-case algorithms for string_t, just like right now certain 
algorithms are specialized for string.

Where's the progress?


Andrei
January 13, 2011
Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]
On 01/13/2011 02:47 PM, Michel Fortin wrote:
> On 2011-01-13 06:48:46 -0500, spir <denis.spir@gmail.com> said:
>
>> Note that D's stdlib currently provides no means to do this, not even
>> on the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode
>> library) (good luck ;-). But even ICU, as well as supposed
>> unicode-aware typse or librarys for any language, would give you an
>> abstraction producing correct results for Michel's example. For
>> instance, Python3 code fails as miserably as any other. AFAIK, D is
>> the first and only language having such a tool (Text.d at
>> https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).
>
> D is not the first language dealing correctly with Unicode strings in
> this manner. Objective-C's NSString class search and compare methods
> deal with characters with combining marks correctly. If you want to
> compare code points, you can do so explicitly using the NSLiteralSearch
> option, but the default is to compare the canonical version (at the
> grapheme level).
> <http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/SearchingStrings.html%23//apple_ref/doc/uid/20000149-CJBBGBAI>

Thank you very much for this information (I feel less lonely ;-).
I'll have a look at this NSString class ASAP, looks like it does 
The-Right-Thing as default (an Apple product...)

> In
> Cocoa, string sorting and case-insensitive comparition is also dependent
> on the user's locale settings, although you can also specify your own
> locale if the user's locale is not what you want.

On this point, I'm more dubitative. (Locale settings do not guarantee 
anything about right way of sorting for given domain, a given app, a 
given use case. There is an infinity of potential choices. But maybe 
it's a right default? See kde trying to invent a, hum, "natural", way of 
sorting file names...)

Denis
_________________
vita es estrany
spir.wikidot.com
January 13, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail@erdani.org> wrote:

> On 1/13/11 8:52 AM, Steven Schveighoffer wrote:
>> I see it as having two vast improvements:
>>
>> 1. If we replace char[] with a specific type for string, then char[] can
>> be considered a true array by phobos, and phobos can now deal with a
>> char[] array without the need to cast.
>> 2. It protects the casual user from incorrectly using a string by making
>> the default the correct API.
>>
>> Those to me are very important.
>
> Let's take a look:
>
> // Incorrect string code
> void fun(string s) {
>    foreach (i; 0 .. s.length) {
>      writeln("The character in position ", i, " is ", s[i]);
>    }
> }
>
> // Incorrect string_t code
> void fun(string_t!char s) {
>    foreach (i; 0 .. s.codeUnits) {
>      writeln("The character in position ", i, " is ", s[i]);
>    }
> }
>
> Both functions are incorrect, albeit in different ways. The only  
> improvement I'm seeing is that the user needs to write codeUnits instead  
> of length, which may make her think twice. Clearly, however, copiously  
> incorrect code can be written with the proposed interface because it  
> tries to hide the reality that underneath a variable-length encoding is  
> being used, but doesn't hide it completely (albeit for good  
> efficiency-related reasons).

You might be looking at my previous version.  The new version (recently  
posted) will throw an exception for that code if a multi-code-unit  
code-point is found.

It also supports this:

foreach(i, d; s)
{
   writeln("The character in position ", i, " is ", d);
}

where i is the index (might not be sequential)

> But wait, there's less. Functions for random-access range throughout  
> Phobos routinely assume fixed-length encoding, i.e. s[i + 1] lies next  
> to s[i]. From a cursory look at string_t, std.range will qualify it as a  
> RandomAccessRange without length. That's an odd beast but does not  
> change the fixed-length encoding assumption. So you'd need to  
> special-case algorithms for string_t, just like right now certain  
> algorithms are specialized for string.

isRandomAccessRange requires hasLength (see here:  
http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532).   
This is not a random access range per that definition.  But a string isn't  
a random access range anyways (it's specifically disallowed by std.range  
per that same reference).

The plan is you would *not* have to special case algorithms for string_t  
as you do currently for char[].  If that's not the case, then we haven't  
achieved much.  Simply put, we are separating out the strange nature of  
strings from arrays, so the exceptional treatment of them is handled by  
the type itself, not the functions using it.

-Steve
January 13, 2011
Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]
"Andrej Mitrovic" <andrej.mitrovich@gmail.com> wrote in message 
news:mailman.604.1294932704.4748.digitalmars-d@puremagic.com...
> OT: Spir, do you know if I can change the syntax highlighting settings
> on bitbucket? I can't see anything with these gray on dark-gray
> colors: http://i.imgur.com/SmLk1.jpg

I'm getting the same problem too.
January 13, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
> On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>> Let's take a look:
>>
>> // Incorrect string code
>> void fun(string s) {
>> foreach (i; 0 .. s.length) {
>> writeln("The character in position ", i, " is ", s[i]);
>> }
>> }
>>
>> // Incorrect string_t code
>> void fun(string_t!char s) {
>> foreach (i; 0 .. s.codeUnits) {
>> writeln("The character in position ", i, " is ", s[i]);
>> }
>> }
>>
>> Both functions are incorrect, albeit in different ways. The only
>> improvement I'm seeing is that the user needs to write codeUnits
>> instead of length, which may make her think twice. Clearly, however,
>> copiously incorrect code can be written with the proposed interface
>> because it tries to hide the reality that underneath a variable-length
>> encoding is being used, but doesn't hide it completely (albeit for
>> good efficiency-related reasons).
>
> You might be looking at my previous version. The new version (recently
> posted) will throw an exception for that code if a multi-code-unit
> code-point is found.

I was looking at your latest. It's code that compiles and runs, but 
dynamically fails on some inputs. I agree that it's often better to fail 
noisily instead of silently, but in a manner of speaking the 
string-based code doesn't fail at all - it correctly iterates the code 
units of a string. This may sometimes not be what the user expected; 
most of the time they'd care about the code points.

> It also supports this:
>
> foreach(i, d; s)
> {
> writeln("The character in position ", i, " is ", d);
> }
>
> where i is the index (might not be sequential)

Well string supports that too, albeit with the nit that you need to 
specify dchar.

>> But wait, there's less. Functions for random-access range throughout
>> Phobos routinely assume fixed-length encoding, i.e. s[i + 1] lies next
>> to s[i]. From a cursory look at string_t, std.range will qualify it as
>> a RandomAccessRange without length. That's an odd beast but does not
>> change the fixed-length encoding assumption. So you'd need to
>> special-case algorithms for string_t, just like right now certain
>> algorithms are specialized for string.
>
> isRandomAccessRange requires hasLength (see here:
> http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532).
> This is not a random access range per that definition.

That's an interesting twist. By the way I specified length is required 
then because I couldn't imagine having random access into something that 
I can't tell the length of. Apparently I was wrong :o).

> But a string
> isn't a random access range anyways (it's specifically disallowed by
> std.range per that same reference).

It isn't and it isn't supposed to be.

> The plan is you would *not* have to special case algorithms for string_t
> as you do currently for char[]. If that's not the case, then we haven't
> achieved much. Simply put, we are separating out the strange nature of
> strings from arrays, so the exceptional treatment of them is handled by
> the type itself, not the functions using it.

That sounds reasonable.


Andrei
January 13, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Thu, 13 Jan 2011 15:51:00 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail@erdani.org> wrote:

> On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
>> On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail@erdani.org> wrote:
>>> Let's take a look:
>>>
>>> // Incorrect string code
>>> void fun(string s) {
>>> foreach (i; 0 .. s.length) {
>>> writeln("The character in position ", i, " is ", s[i]);
>>> }
>>> }
>>>
>>> // Incorrect string_t code
>>> void fun(string_t!char s) {
>>> foreach (i; 0 .. s.codeUnits) {
>>> writeln("The character in position ", i, " is ", s[i]);
>>> }
>>> }
>>>
>>> Both functions are incorrect, albeit in different ways. The only
>>> improvement I'm seeing is that the user needs to write codeUnits
>>> instead of length, which may make her think twice. Clearly, however,
>>> copiously incorrect code can be written with the proposed interface
>>> because it tries to hide the reality that underneath a variable-length
>>> encoding is being used, but doesn't hide it completely (albeit for
>>> good efficiency-related reasons).
>>
>> You might be looking at my previous version. The new version (recently
>> posted) will throw an exception for that code if a multi-code-unit
>> code-point is found.
>
> I was looking at your latest. It's code that compiles and runs, but  
> dynamically fails on some inputs. I agree that it's often better to fail  
> noisily instead of silently, but in a manner of speaking the  
> string-based code doesn't fail at all - it correctly iterates the code  
> units of a string. This may sometimes not be what the user expected;  
> most of the time they'd care about the code points.

iterating the code units is possible by accessing the array data.  i.e.  
you could do:

foreach(i, c; s.data)

if you want the code-units.

That is the point of having a separate type.  Using string_t tells the  
library "I'm using this data as a string".  Using char[] tells the library  
"I'm using this data as an array."

The difference here is, you have to *specifically* try to access the code  
units, the default is code-points.  All it does really is switch the  
default.

>> It also supports this:
>>
>> foreach(i, d; s)
>> {
>> writeln("The character in position ", i, " is ", d);
>> }
>>
>> where i is the index (might not be sequential)
>
> Well string supports that too, albeit with the nit that you need to  
> specify dchar.

This is not a small problem.

>> isRandomAccessRange requires hasLength (see here:
>> http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532).
>> This is not a random access range per that definition.
>
> That's an interesting twist. By the way I specified length is required  
> then because I couldn't imagine having random access into something that  
> I can't tell the length of. Apparently I was wrong :o).

Yes, in fact, you could say that specifically defines VLERange ;)  But  
actually, there are two types of VLE ranges, those which can be randomly  
accessed (where determining the beginning of a code point, given a random  
index is possible) and those that cannot (where decoding depends on the  
exact order of the data).  Actually, those would not be bi-directional  
ranges anyways.

>> But a string
>> isn't a random access range anyways (it's specifically disallowed by
>> std.range per that same reference).
>
> It isn't and it isn't supposed to be.

I agree with that assessment, which is why I omitted length.

-Steve
1 2 3 4 5 6 7 8
Top | Discussion index | About this forum | D home