April 23, 2012
I'm writing an introduction/tutorial to using strings in D, paying particular attention to the complexities of UTF-8 and 16. I realised that when you want the number of characters, you normally actually want to use walkLength, not length. Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?

It's just a thought because a lot of the time, using length will get the right answer, but for the wrong reasons, resulting in lurking bugs. You can always cast to immutable(ubyte)[] or immutable(short)[] if you want to work with the actual bytes anyway.
April 23, 2012
On Monday, 23 April 2012 at 23:01:59 UTC, James Miller wrote:
> Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?


Maybe... but it is important that this works:

string s;

if(s.length)
   do_something(s);

since that's always right and quite common.
April 23, 2012
James Miller:

> I realised that when you want the number of characters, you normally actually want to use walkLength, not length.

As with strlen() in C, unfortunately the result of walkLength(somestring) is computed every time you call it... because it's doesn't get cached.
A partial improvement for this situation is to assure walkLength(somestring) to be strongly pure, and to assure the D compiler is able to move this invariant pure computation out of loops.


> Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?

This is not easy to do, because sometimes you want to know the number of code points, and sometimes of code units.
I remember even a proposal to rename the "length" field to another name for narrow strings, to avoid such bugs.

-----------------------

Adam D. Ruppe:

> Maybe... but it is important that this works:
>
> string s;
>
> if(s.length)
>    do_something(s);
>
> since that's always right and quite common.

Better:

if (!s.empty)
    do_something(s);

(or even better, built-in non-ulls, usable for strings too).

Bye,
bearophile
April 24, 2012
On Monday, 23 April 2012 at 23:52:41 UTC, bearophile wrote:
> James Miller:
>
>> I realised that when you want the number of characters, you normally actually want to use walkLength, not length.
>
> As with strlen() in C, unfortunately the result of walkLength(somestring) is computed every time you call it... because it's doesn't get cached.
> A partial improvement for this situation is to assure walkLength(somestring) to be strongly pure, and to assure the D compiler is able to move this invariant pure computation out of loops.
>
>
>> Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?
>
> This is not easy to do, because sometimes you want to know the number of code points, and sometimes of code units.
> I remember even a proposal to rename the "length" field to another name for narrow strings, to avoid such bugs.

I was thinking about that. This is quite a vague suggestion, more just throwing the idea out there and seeing what people think. I am aware of the issue of walkLength being computed every time, rather than being a constant lookup. One option would be to make it only a warning in @safe code, so worst case scenario is that you mark the function as @trusted. I feel this fits in with the idea of @safe quite well, since you have to explicitly tell the compiler that you know what you're doing.

Another option would be to have some sort of general lint tool that picks up on these kinds of potential errors, though that is a lot bigger scope...

--
James Miller
April 24, 2012
James Miller:

> Another option would be to have some sort of general lint tool that picks up on these kinds of potential errors, though that is a lot bigger scope...

Lot of people in D.learn don't even use "-wi -property" so go figure how many will use a lint :-)

In first approximation you can rely only on what people see compiling with "dmd foo.d", that is the most basic compilation use only. More serious programmers thankfully activate warnings.

Bye,
bearophile
April 24, 2012
On Tuesday, April 24, 2012 01:01:57 James Miller wrote:
> I'm writing an introduction/tutorial to using strings in D, paying particular attention to the complexities of UTF-8 and 16. I realised that when you want the number of characters, you normally actually want to use walkLength, not length. Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?
> 
> It's just a thought because a lot of the time, using length will get the right answer, but for the wrong reasons, resulting in lurking bugs. You can always cast to immutable(ubyte)[] or immutable(short)[] if you want to work with the actual bytes anyway.

At this point, I don't think that it makes any sense to give a warning for this. The compiler can't possibly know whether using length is a good idea or correct in any particular set of code. If we really want to do something to tackle the problem, then we should create a new string type which better solves the issues. There's a _lot_ more to be worried about due to the fact that strings are variable length encoded than just their length.

There has been talk of creating a new string type, and there has been talk of creating the concept of a variable length encoded range which better handles all of this stuff, but no proposal thus far has gotten anywhere.

As for walkLength being O(n) in many cases (as discussed elsewhere in this thread), I don't think that it's that big a deal. If you know what it's doing, you know that it's O(n), and it's simple enough to simply save the result if you need to call it multiple times.

- Jonathan M Davis
April 26, 2012
"James Miller" <james@aatch.net> wrote in message news:qdgacdzxkhmhojqcettj@forum.dlang.org...
> I'm writing an introduction/tutorial to using strings in D, paying particular attention to the complexities of UTF-8 and 16. I realised that when you want the number of characters, you normally actually want to use walkLength, not length. Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?
>
> It's just a thought because a lot of the time, using length will get the right answer, but for the wrong reasons, resulting in lurking bugs. You can always cast to immutable(ubyte)[] or immutable(short)[] if you want to work with the actual bytes anyway.

I find that most of the time I actually *do* want to use length. Don't know if that's common, though, or if it's just a reflection of my particular use-cases.

Also, keep in mind that (unless I'm mistaken) walkLength does *not* return the number of "characters" (ie, graphemes), but merely the number of code points - which is not the same thing (due to existence of the [confusingly-named] "combining characters").


April 26, 2012
On Thursday, April 26, 2012 13:51:17 Nick Sabalausky wrote:
> Also, keep in mind that (unless I'm mistaken) walkLength does *not* return
> the number of "characters" (ie, graphemes), but merely the number of code
> points - which is not the same thing (due to existence of the
> [confusingly-named] "combining characters").

You're not mistaken. Nothing in Phobos (save perhaps some of std.regex's internals) deals with graphemes. It all operates on code points, and strings are considered to be ranges of code points, not graphemes. So, as far as ranges go, walkLength returns the actual length of the range. That's _usually_ the number of characters/graphemes as well, but it's certainly not 100% correct. We'll need further unicode facilities in Phobos to deal with that though, and I doubt that strings will ever change to be treated as ranges of graphemes, since that would be incredibly expensive computationally. We have enough performance problems with strings as it is. What we'll probably get is extra functions to deal with normalization (and probably something to count the number of graphemes) and probably a wrapper type that does deal in graphemes.

Regardless, you're right about walkLength returning the number of code points rather than graphemes, because strings are considered to be ranges of dchar.

- Jonathan M Davis
April 26, 2012
On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:
> "James Miller" <james@aatch.net> wrote in message news:qdgacdzxkhmhojqcettj@forum.dlang.org...
> > I'm writing an introduction/tutorial to using strings in D, paying particular attention to the complexities of UTF-8 and 16. I realised that when you want the number of characters, you normally actually want to use walkLength, not length. Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?
> >
> > It's just a thought because a lot of the time, using length will get the right answer, but for the wrong reasons, resulting in lurking bugs. You can always cast to immutable(ubyte)[] or immutable(short)[] if you want to work with the actual bytes anyway.
> 
> I find that most of the time I actually *do* want to use length. Don't know if that's common, though, or if it's just a reflection of my particular use-cases.
> 
> Also, keep in mind that (unless I'm mistaken) walkLength does *not*
> return the number of "characters" (ie, graphemes), but merely the
> number of code points - which is not the same thing (due to existence
> of the [confusingly-named] "combining characters").
[...]

And don't forget that some code points (notably from the CJK block) are
specified as "double-width", so if you're trying to do text layout,
you'll want yet a different length (layoutLength?).

So we really need all four lengths. Ain't unicode fun?! :-)

Array length is simple.  Walklength is already implemented. Grapheme length requires recognition of 'combining characters' (or rather, ignoring said characters), and layout length requires recognizing widthless, single- and double-width characters.

I've been thinking about unicode processing recently. Traditionally, we have to decode narrow strings into UTF-32 (aka dchar) then do table lookups and such. But unicode encoding and properties, etc., are static information (at least within a single unicode release). So why bother with hardcoding tables and stuff at all?

What we *really* should be doing, esp. for commonly-used functions like computing various lengths, is to automatically process said tables and encode the computation in finite-state machines that can then be optimized at the FSM level (there are known algos for generating optimal FSMs), codegen'd, and then optimized again at the assembly level by the compiler. These FSMs will operate at the native narrow string char type level, so that there will be no need for explicit decoding.

The generation algo can then be run just once per unicode release, and everything will Just Work.


T

-- 
Give me some fresh salted fish, please.
April 26, 2012
"Jonathan M Davis" <jmdavisProg@gmx.com> wrote in message news:mailman.2166.1335463456.4860.digitalmars-d@puremagic.com...
> On Thursday, April 26, 2012 13:51:17 Nick Sabalausky wrote:
>> Also, keep in mind that (unless I'm mistaken) walkLength does *not*
>> return
>> the number of "characters" (ie, graphemes), but merely the number of code
>> points - which is not the same thing (due to existence of the
>> [confusingly-named] "combining characters").
>
> You're not mistaken. Nothing in Phobos (save perhaps some of std.regex's
> internals) deals with graphemes. It all operates on code points, and
> strings
> are considered to be ranges of code points, not graphemes. So, as far as
> ranges go, walkLength returns the actual length of the range. That's
> _usually_
> the number of characters/graphemes as well, but it's certainly not 100%
> correct. We'll need further unicode facilities in Phobos to deal with that
> though, and I doubt that strings will ever change to be treated as ranges
> of
> graphemes, since that would be incredibly expensive computationally. We
> have
> enough performance problems with strings as it is. What we'll probably get
> is
> extra functions to deal with normalization (and probably something to
> count
> the number of graphemes) and probably a wrapper type that does deal in
> graphemes.
>

Yea, I'm not saying that walkLength should deal with graphemes. Just that if someone wants the number of "characters", then neither length *nor* walkLength are guaranteed to be correct.


« First   ‹ Prev
1 2 3
Top | Discussion index | About this forum | D home