View mode: basic / threaded / horizontal-split · Log in · Help
April 23, 2012
Notice/Warning on narrowStrings .length
I'm writing an introduction/tutorial to using strings in D, 
paying particular attention to the complexities of UTF-8 and 16. 
I realised that when you want the number of characters, you 
normally actually want to use walkLength, not length. Is is 
reasonable for the compiler to pick this up during semantic 
analysis and point out this situation?

It's just a thought because a lot of the time, using length will 
get the right answer, but for the wrong reasons, resulting in 
lurking bugs. You can always cast to immutable(ubyte)[] or 
immutable(short)[] if you want to work with the actual bytes 
anyway.
April 23, 2012
Re: Notice/Warning on narrowStrings .length
On Monday, 23 April 2012 at 23:01:59 UTC, James Miller wrote:
> Is is reasonable for the compiler to pick this up during 
> semantic analysis and point out this situation?


Maybe... but it is important that this works:

string s;

if(s.length)
   do_something(s);

since that's always right and quite common.
April 23, 2012
Re: Notice/Warning on narrowStrings .length
James Miller:

> I realised that when you want the number of characters, you 
> normally actually want to use walkLength, not length.

As with strlen() in C, unfortunately the result of 
walkLength(somestring) is computed every time you call it... 
because it's doesn't get cached.
A partial improvement for this situation is to assure 
walkLength(somestring) to be strongly pure, and to assure the D 
compiler is able to move this invariant pure computation out of 
loops.


> Is is reasonable for the compiler to pick this up during 
> semantic analysis and point out this situation?

This is not easy to do, because sometimes you want to know the 
number of code points, and sometimes of code units.
I remember even a proposal to rename the "length" field to 
another name for narrow strings, to avoid such bugs.

-----------------------

Adam D. Ruppe:

> Maybe... but it is important that this works:
>
> string s;
>
> if(s.length)
>    do_something(s);
>
> since that's always right and quite common.

Better:

if (!s.empty)
    do_something(s);

(or even better, built-in non-ulls, usable for strings too).

Bye,
bearophile
April 24, 2012
Re: Notice/Warning on narrowStrings .length
On Monday, 23 April 2012 at 23:52:41 UTC, bearophile wrote:
> James Miller:
>
>> I realised that when you want the number of characters, you 
>> normally actually want to use walkLength, not length.
>
> As with strlen() in C, unfortunately the result of 
> walkLength(somestring) is computed every time you call it... 
> because it's doesn't get cached.
> A partial improvement for this situation is to assure 
> walkLength(somestring) to be strongly pure, and to assure the D 
> compiler is able to move this invariant pure computation out of 
> loops.
>
>
>> Is is reasonable for the compiler to pick this up during 
>> semantic analysis and point out this situation?
>
> This is not easy to do, because sometimes you want to know the 
> number of code points, and sometimes of code units.
> I remember even a proposal to rename the "length" field to 
> another name for narrow strings, to avoid such bugs.

I was thinking about that. This is quite a vague suggestion, more 
just throwing the idea out there and seeing what people think. I 
am aware of the issue of walkLength being computed every time, 
rather than being a constant lookup. One option would be to make 
it only a warning in @safe code, so worst case scenario is that 
you mark the function as @trusted. I feel this fits in with the 
idea of @safe quite well, since you have to explicitly tell the 
compiler that you know what you're doing.

Another option would be to have some sort of general lint tool 
that picks up on these kinds of potential errors, though that is 
a lot bigger scope...

--
James Miller
April 24, 2012
Re: Notice/Warning on narrowStrings .length
James Miller:

> Another option would be to have some sort of general lint tool 
> that picks up on these kinds of potential errors, though that 
> is a lot bigger scope...

Lot of people in D.learn don't even use "-wi -property" so go 
figure how many will use a lint :-)

In first approximation you can rely only on what people see 
compiling with "dmd foo.d", that is the most basic compilation 
use only. More serious programmers thankfully activate warnings.

Bye,
bearophile
April 24, 2012
Re: Notice/Warning on narrowStrings .length
On Tuesday, April 24, 2012 01:01:57 James Miller wrote:
> I'm writing an introduction/tutorial to using strings in D,
> paying particular attention to the complexities of UTF-8 and 16.
> I realised that when you want the number of characters, you
> normally actually want to use walkLength, not length. Is is
> reasonable for the compiler to pick this up during semantic
> analysis and point out this situation?
> 
> It's just a thought because a lot of the time, using length will
> get the right answer, but for the wrong reasons, resulting in
> lurking bugs. You can always cast to immutable(ubyte)[] or
> immutable(short)[] if you want to work with the actual bytes
> anyway.

At this point, I don't think that it makes any sense to give a warning for 
this. The compiler can't possibly know whether using length is a good idea or 
correct in any particular set of code. If we really want to do something to 
tackle the problem, then we should create a new string type which better 
solves the issues. There's a _lot_ more to be worried about due to the fact 
that strings are variable length encoded than just their length.

There has been talk of creating a new string type, and there has been talk of 
creating the concept of a variable length encoded range which better handles 
all of this stuff, but no proposal thus far has gotten anywhere.

As for walkLength being O(n) in many cases (as discussed elsewhere in this 
thread), I don't think that it's that big a deal. If you know what it's doing, 
you know that it's O(n), and it's simple enough to simply save the result if 
you need to call it multiple times.

- Jonathan M Davis
April 26, 2012
Re: Notice/Warning on narrowStrings .length
"James Miller" <james@aatch.net> wrote in message 
news:qdgacdzxkhmhojqcettj@forum.dlang.org...
> I'm writing an introduction/tutorial to using strings in D, paying 
> particular attention to the complexities of UTF-8 and 16. I realised that 
> when you want the number of characters, you normally actually want to use 
> walkLength, not length. Is is reasonable for the compiler to pick this up 
> during semantic analysis and point out this situation?
>
> It's just a thought because a lot of the time, using length will get the 
> right answer, but for the wrong reasons, resulting in lurking bugs. You 
> can always cast to immutable(ubyte)[] or immutable(short)[] if you want to 
> work with the actual bytes anyway.

I find that most of the time I actually *do* want to use length. Don't know 
if that's common, though, or if it's just a reflection of my particular 
use-cases.

Also, keep in mind that (unless I'm mistaken) walkLength does *not* return 
the number of "characters" (ie, graphemes), but merely the number of code 
points - which is not the same thing (due to existence of the 
[confusingly-named] "combining characters").
April 26, 2012
Re: Notice/Warning on narrowStrings .length
On Thursday, April 26, 2012 13:51:17 Nick Sabalausky wrote:
> Also, keep in mind that (unless I'm mistaken) walkLength does *not* return
> the number of "characters" (ie, graphemes), but merely the number of code
> points - which is not the same thing (due to existence of the
> [confusingly-named] "combining characters").

You're not mistaken. Nothing in Phobos (save perhaps some of std.regex's 
internals) deals with graphemes. It all operates on code points, and strings 
are considered to be ranges of code points, not graphemes. So, as far as 
ranges go, walkLength returns the actual length of the range. That's _usually_ 
the number of characters/graphemes as well, but it's certainly not 100% 
correct. We'll need further unicode facilities in Phobos to deal with that 
though, and I doubt that strings will ever change to be treated as ranges of 
graphemes, since that would be incredibly expensive computationally. We have 
enough performance problems with strings as it is. What we'll probably get is 
extra functions to deal with normalization (and probably something to count 
the number of graphemes) and probably a wrapper type that does deal in 
graphemes.

Regardless, you're right about walkLength returning the number of code points 
rather than graphemes, because strings are considered to be ranges of dchar.

- Jonathan M Davis
April 26, 2012
Re: Notice/Warning on narrowStrings .length
On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:
> "James Miller" <james@aatch.net> wrote in message 
> news:qdgacdzxkhmhojqcettj@forum.dlang.org...
> > I'm writing an introduction/tutorial to using strings in D, paying
> > particular attention to the complexities of UTF-8 and 16. I realised
> > that when you want the number of characters, you normally actually
> > want to use walkLength, not length. Is is reasonable for the
> > compiler to pick this up during semantic analysis and point out this
> > situation?
> >
> > It's just a thought because a lot of the time, using length will get
> > the right answer, but for the wrong reasons, resulting in lurking
> > bugs. You can always cast to immutable(ubyte)[] or
> > immutable(short)[] if you want to work with the actual bytes anyway.
> 
> I find that most of the time I actually *do* want to use length. Don't
> know if that's common, though, or if it's just a reflection of my
> particular use-cases.
> 
> Also, keep in mind that (unless I'm mistaken) walkLength does *not*
> return the number of "characters" (ie, graphemes), but merely the
> number of code points - which is not the same thing (due to existence
> of the [confusingly-named] "combining characters").
[...]

And don't forget that some code points (notably from the CJK block) are
specified as "double-width", so if you're trying to do text layout,
you'll want yet a different length (layoutLength?).

So we really need all four lengths. Ain't unicode fun?! :-)

Array length is simple.  Walklength is already implemented. Grapheme
length requires recognition of 'combining characters' (or rather,
ignoring said characters), and layout length requires recognizing
widthless, single- and double-width characters.

I've been thinking about unicode processing recently. Traditionally, we
have to decode narrow strings into UTF-32 (aka dchar) then do table
lookups and such. But unicode encoding and properties, etc., are static
information (at least within a single unicode release). So why bother
with hardcoding tables and stuff at all?

What we *really* should be doing, esp. for commonly-used functions like
computing various lengths, is to automatically process said tables and
encode the computation in finite-state machines that can then be
optimized at the FSM level (there are known algos for generating optimal
FSMs), codegen'd, and then optimized again at the assembly level by the
compiler. These FSMs will operate at the native narrow string char type
level, so that there will be no need for explicit decoding.

The generation algo can then be run just once per unicode release, and
everything will Just Work.


T

-- 
Give me some fresh salted fish, please.
April 26, 2012
Re: Notice/Warning on narrowStrings .length
"Jonathan M Davis" <jmdavisProg@gmx.com> wrote in message 
news:mailman.2166.1335463456.4860.digitalmars-d@puremagic.com...
> On Thursday, April 26, 2012 13:51:17 Nick Sabalausky wrote:
>> Also, keep in mind that (unless I'm mistaken) walkLength does *not* 
>> return
>> the number of "characters" (ie, graphemes), but merely the number of code
>> points - which is not the same thing (due to existence of the
>> [confusingly-named] "combining characters").
>
> You're not mistaken. Nothing in Phobos (save perhaps some of std.regex's
> internals) deals with graphemes. It all operates on code points, and 
> strings
> are considered to be ranges of code points, not graphemes. So, as far as
> ranges go, walkLength returns the actual length of the range. That's 
> _usually_
> the number of characters/graphemes as well, but it's certainly not 100%
> correct. We'll need further unicode facilities in Phobos to deal with that
> though, and I doubt that strings will ever change to be treated as ranges 
> of
> graphemes, since that would be incredibly expensive computationally. We 
> have
> enough performance problems with strings as it is. What we'll probably get 
> is
> extra functions to deal with normalization (and probably something to 
> count
> the number of graphemes) and probably a wrapper type that does deal in
> graphemes.
>

Yea, I'm not saying that walkLength should deal with graphemes. Just that if 
someone wants the number of "characters", then neither length *nor* 
walkLength are guaranteed to be correct.
« First   ‹ Prev
1 2 3
Top | Discussion index | About this forum | D home