Notice/Warning on narrowStrings .length - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Notice/Warning on narrowStrings .length

Thread overview

Notice/Warning on narrowStrings .length
Apr 23, 2012 James Miller
Apr 23, 2012 Adam D. Ruppe
Apr 23, 2012 bearophile
Apr 24, 2012 James Miller
Apr 24, 2012 bearophile
Apr 24, 2012 Jonathan M Davis
Apr 26, 2012 Nick Sabalausky
Apr 26, 2012 Jonathan M Davis
Apr 26, 2012 Nick Sabalausky
Apr 26, 2012 H. S. Teoh
Apr 26, 2012 Nick Sabalausky
Apr 27, 2012 H. S. Teoh
Apr 27, 2012 Nick Sabalausky
Apr 27, 2012 H. S. Teoh
Apr 27, 2012 Nick Sabalausky
Apr 27, 2012 H. S. Teoh
Apr 27, 2012 Nick Sabalausky
Apr 27, 2012 Nathan M. Swan
Apr 28, 2012 Nick Sabalausky
Apr 27, 2012 Matt Peterson
Apr 27, 2012 H. S. Teoh
Apr 27, 2012 Dmitry Olshansky
Apr 27, 2012 Nick Sabalausky
Apr 27, 2012 Dmitry Olshansky
Apr 27, 2012 Andrej Mitrovic
Apr 27, 2012 Nick Sabalausky
Apr 27, 2012 Brad Anderson
Apr 27, 2012 Jonathan M Davis
Apr 27, 2012 Dmitry Olshansky
Apr 27, 2012 H. S. Teoh

April 23, 2012

Notice/Warning on narrowStrings .length

Posted by James Miller

James Miller

I'm writing an introduction/tutorial to using strings in D, paying particular attention to the complexities of UTF-8 and 16. I realised that when you want the number of characters, you normally actually want to use walkLength, not length. Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?

It's just a thought because a lot of the time, using length will get the right answer, but for the wrong reasons, resulting in lurking bugs. You can always cast to immutable(ubyte)[] or immutable(short)[] if you want to work with the actual bytes anyway.

April 23, 2012

Re: Notice/Warning on narrowStrings .length

Posted by Adam D. Ruppe
in reply to James Miller

Adam D. Ruppe

Posted in reply to James Miller

On Monday, 23 April 2012 at 23:01:59 UTC, James Miller wrote:
> Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?


Maybe... but it is important that this works:

string s;

if(s.length)
   do_something(s);

since that's always right and quite common.

April 23, 2012

Re: Notice/Warning on narrowStrings .length

Posted by bearophile
in reply to James Miller

bearophile

Posted in reply to James Miller

James Miller:

> I realised that when you want the number of characters, you normally actually want to use walkLength, not length.

As with strlen() in C, unfortunately the result of walkLength(somestring) is computed every time you call it... because it's doesn't get cached.
A partial improvement for this situation is to assure walkLength(somestring) to be strongly pure, and to assure the D compiler is able to move this invariant pure computation out of loops.


> Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?

This is not easy to do, because sometimes you want to know the number of code points, and sometimes of code units.
I remember even a proposal to rename the "length" field to another name for narrow strings, to avoid such bugs.

-----------------------

Adam D. Ruppe:

> Maybe... but it is important that this works:
>
> string s;
>
> if(s.length)
>    do_something(s);
>
> since that's always right and quite common.

Better:

if (!s.empty)
    do_something(s);

(or even better, built-in non-ulls, usable for strings too).

Bye,
bearophile

April 24, 2012

Re: Notice/Warning on narrowStrings .length

Posted by James Miller
in reply to bearophile

James Miller

Posted in reply to bearophile

On Monday, 23 April 2012 at 23:52:41 UTC, bearophile wrote:
> James Miller:
>
>> I realised that when you want the number of characters, you normally actually want to use walkLength, not length.
>
> As with strlen() in C, unfortunately the result of walkLength(somestring) is computed every time you call it... because it's doesn't get cached.
> A partial improvement for this situation is to assure walkLength(somestring) to be strongly pure, and to assure the D compiler is able to move this invariant pure computation out of loops.
>
>
>> Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?
>
> This is not easy to do, because sometimes you want to know the number of code points, and sometimes of code units.
> I remember even a proposal to rename the "length" field to another name for narrow strings, to avoid such bugs.

I was thinking about that. This is quite a vague suggestion, more just throwing the idea out there and seeing what people think. I am aware of the issue of walkLength being computed every time, rather than being a constant lookup. One option would be to make it only a warning in @safe code, so worst case scenario is that you mark the function as @trusted. I feel this fits in with the idea of @safe quite well, since you have to explicitly tell the compiler that you know what you're doing.

Another option would be to have some sort of general lint tool that picks up on these kinds of potential errors, though that is a lot bigger scope...

--
James Miller

April 24, 2012

Re: Notice/Warning on narrowStrings .length

Posted by bearophile
in reply to James Miller

bearophile

Posted in reply to James Miller

James Miller:

> Another option would be to have some sort of general lint tool that picks up on these kinds of potential errors, though that is a lot bigger scope...

Lot of people in D.learn don't even use "-wi -property" so go figure how many will use a lint :-)

In first approximation you can rely only on what people see compiling with "dmd foo.d", that is the most basic compilation use only. More serious programmers thankfully activate warnings.

Bye,
bearophile

April 24, 2012

Re: Notice/Warning on narrowStrings .length

Posted by Jonathan M Davis
in reply to James Miller

Jonathan M Davis

Posted in reply to James Miller

On Tuesday, April 24, 2012 01:01:57 James Miller wrote:
> I'm writing an introduction/tutorial to using strings in D, paying particular attention to the complexities of UTF-8 and 16. I realised that when you want the number of characters, you normally actually want to use walkLength, not length. Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?
> 
> It's just a thought because a lot of the time, using length will get the right answer, but for the wrong reasons, resulting in lurking bugs. You can always cast to immutable(ubyte)[] or immutable(short)[] if you want to work with the actual bytes anyway.

At this point, I don't think that it makes any sense to give a warning for this. The compiler can't possibly know whether using length is a good idea or correct in any particular set of code. If we really want to do something to tackle the problem, then we should create a new string type which better solves the issues. There's a _lot_ more to be worried about due to the fact that strings are variable length encoded than just their length.

There has been talk of creating a new string type, and there has been talk of creating the concept of a variable length encoded range which better handles all of this stuff, but no proposal thus far has gotten anywhere.

As for walkLength being O(n) in many cases (as discussed elsewhere in this thread), I don't think that it's that big a deal. If you know what it's doing, you know that it's O(n), and it's simple enough to simply save the result if you need to call it multiple times.

- Jonathan M Davis

April 26, 2012

Re: Notice/Warning on narrowStrings .length

Posted by Nick Sabalausky
in reply to James Miller

Nick Sabalausky

Posted in reply to James Miller

"James Miller" <james@aatch.net> wrote in message news:qdgacdzxkhmhojqcettj@forum.dlang.org...
> I'm writing an introduction/tutorial to using strings in D, paying particular attention to the complexities of UTF-8 and 16. I realised that when you want the number of characters, you normally actually want to use walkLength, not length. Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?
>
> It's just a thought because a lot of the time, using length will get the right answer, but for the wrong reasons, resulting in lurking bugs. You can always cast to immutable(ubyte)[] or immutable(short)[] if you want to work with the actual bytes anyway.

I find that most of the time I actually *do* want to use length. Don't know if that's common, though, or if it's just a reflection of my particular use-cases.

Also, keep in mind that (unless I'm mistaken) walkLength does *not* return the number of "characters" (ie, graphemes), but merely the number of code points - which is not the same thing (due to existence of the [confusingly-named] "combining characters").

April 26, 2012

Re: Notice/Warning on narrowStrings .length

Posted by Jonathan M Davis
in reply to Nick Sabalausky

Jonathan M Davis

Posted in reply to Nick Sabalausky

On Thursday, April 26, 2012 13:51:17 Nick Sabalausky wrote:
> Also, keep in mind that (unless I'm mistaken) walkLength does *not* return
> the number of "characters" (ie, graphemes), but merely the number of code
> points - which is not the same thing (due to existence of the
> [confusingly-named] "combining characters").

You're not mistaken. Nothing in Phobos (save perhaps some of std.regex's internals) deals with graphemes. It all operates on code points, and strings are considered to be ranges of code points, not graphemes. So, as far as ranges go, walkLength returns the actual length of the range. That's _usually_ the number of characters/graphemes as well, but it's certainly not 100% correct. We'll need further unicode facilities in Phobos to deal with that though, and I doubt that strings will ever change to be treated as ranges of graphemes, since that would be incredibly expensive computationally. We have enough performance problems with strings as it is. What we'll probably get is extra functions to deal with normalization (and probably something to count the number of graphemes) and probably a wrapper type that does deal in graphemes.

Regardless, you're right about walkLength returning the number of code points rather than graphemes, because strings are considered to be ranges of dchar.

- Jonathan M Davis

April 26, 2012

Re: Notice/Warning on narrowStrings .length

Posted by H. S. Teoh
in reply to Nick Sabalausky

H. S. Teoh

Posted in reply to Nick Sabalausky

On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:
> "James Miller" <james@aatch.net> wrote in message news:qdgacdzxkhmhojqcettj@forum.dlang.org...
> > I'm writing an introduction/tutorial to using strings in D, paying particular attention to the complexities of UTF-8 and 16. I realised that when you want the number of characters, you normally actually want to use walkLength, not length. Is is reasonable for the compiler to pick this up during semantic analysis and point out this situation?
> >
> > It's just a thought because a lot of the time, using length will get the right answer, but for the wrong reasons, resulting in lurking bugs. You can always cast to immutable(ubyte)[] or immutable(short)[] if you want to work with the actual bytes anyway.
> 
> I find that most of the time I actually *do* want to use length. Don't know if that's common, though, or if it's just a reflection of my particular use-cases.
> 
> Also, keep in mind that (unless I'm mistaken) walkLength does *not*
> return the number of "characters" (ie, graphemes), but merely the
> number of code points - which is not the same thing (due to existence
> of the [confusingly-named] "combining characters").
[...]

And don't forget that some code points (notably from the CJK block) are
specified as "double-width", so if you're trying to do text layout,
you'll want yet a different length (layoutLength?).

So we really need all four lengths. Ain't unicode fun?! :-)

Array length is simple.  Walklength is already implemented. Grapheme length requires recognition of 'combining characters' (or rather, ignoring said characters), and layout length requires recognizing widthless, single- and double-width characters.

I've been thinking about unicode processing recently. Traditionally, we have to decode narrow strings into UTF-32 (aka dchar) then do table lookups and such. But unicode encoding and properties, etc., are static information (at least within a single unicode release). So why bother with hardcoding tables and stuff at all?

What we *really* should be doing, esp. for commonly-used functions like computing various lengths, is to automatically process said tables and encode the computation in finite-state machines that can then be optimized at the FSM level (there are known algos for generating optimal FSMs), codegen'd, and then optimized again at the assembly level by the compiler. These FSMs will operate at the native narrow string char type level, so that there will be no need for explicit decoding.

The generation algo can then be run just once per unicode release, and everything will Just Work.

T

-- 
Give me some fresh salted fish, please.

April 26, 2012

Re: Notice/Warning on narrowStrings .length

Posted by Nick Sabalausky
in reply to Jonathan M Davis

Nick Sabalausky

Posted in reply to Jonathan M Davis

"Jonathan M Davis" <jmdavisProg@gmx.com> wrote in message news:mailman.2166.1335463456.4860.digitalmars-d@puremagic.com...
> On Thursday, April 26, 2012 13:51:17 Nick Sabalausky wrote:
>> Also, keep in mind that (unless I'm mistaken) walkLength does *not*
>> return
>> the number of "characters" (ie, graphemes), but merely the number of code
>> points - which is not the same thing (due to existence of the
>> [confusingly-named] "combining characters").
>
> You're not mistaken. Nothing in Phobos (save perhaps some of std.regex's
> internals) deals with graphemes. It all operates on code points, and
> strings
> are considered to be ranges of code points, not graphemes. So, as far as
> ranges go, walkLength returns the actual length of the range. That's
> _usually_
> the number of characters/graphemes as well, but it's certainly not 100%
> correct. We'll need further unicode facilities in Phobos to deal with that
> though, and I doubt that strings will ever change to be treated as ranges
> of
> graphemes, since that would be incredibly expensive computationally. We
> have
> enough performance problems with strings as it is. What we'll probably get
> is
> extra functions to deal with normalization (and probably something to
> count
> the number of graphemes) and probably a wrapper type that does deal in
> graphemes.
>

Yea, I'm not saying that walkLength should deal with graphemes. Just that if someone wants the number of "characters", then neither length *nor* walkLength are guaranteed to be correct.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation