October 27, 2013 Re: How to get a substring? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Timothee Cour | On Sunday, 27 October 2013 at 03:45:50 UTC, Timothee Cour wrote:
> On Sat, Oct 26, 2013 at 6:24 PM, Nicolas Sicard <dransic@gmail.com> wrote:
>
>> On Sunday, 27 October 2013 at 00:18:41 UTC, Timothee Cour wrote:
>>
>>> I've posted a while back a string=>string substring function that doesn't
>>> allocating: google
>>> "nonallocating unicode string manipulations"
>>>
>>> code:
>>>
>>> auto slice(T)(T a,size_t u, size_t v)if(is(T==string)){//TODO:**generalize
>>> to
>>> isSomeString
>>> import std.exception;
>>> auto m=a.length;
>>> size_t i;
>>> enforce(u<=v);
>>> import std.utf;
>>> while(u-- && i<m){
>>> auto si=stride(a,i);
>>> i+=si;
>>> v--;
>>> }
>>> // assert(u==-1);
>>> // enforce(u==-1);
>>> size_t i2=i;
>>> while(v-- && i2<m){
>>> auto si=stride(a,i2);
>>> i2+=si;
>>> }
>>> // assert(v==-1);
>>> enforce(v==-1);
>>> return a[i..i2];
>>> }
>>> unittest{
>>> import std.range;
>>> auto a="≈açç√ef";
>>> auto b=a.slice(2,6);
>>> assert(a.slice(2,6)=="çç√e");
>>> assert(a.slice(2,6).ptr==a.**slice(2,3).ptr);
>>> assert(a.slice(0,a.walkLength) is a);
>>> import std.exception;
>>> assertThrown(a.slice(2,8));
>>> assertThrown(a.slice(2,1));
>>> }
>>>
>>>
>> Another one, with negative index like Javascript's String.slice():
>> http://dpaste.dzfl.pl/608435c5
>>
>
> not as efficient as what I proposed since it's iterating over the string
> twice (the 2nd index redoes the work done by 1st index). Could be adapted
> though.
Yes. slice was a quick addition to before/after.
I also wonder if std.uni.graphemeStride would be more appropriate.
|
October 27, 2013 Re: How to get a substring? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Gautam Goel | On Saturday, 26 October 2013 at 21:23:13 UTC, Gautam Goel wrote:
> Dumb Newbie Question: I've searched through the library reference, but I haven't figured out how to extract a substring from a string. I'd like something like string.substring("Hello", 0, 2) to return "Hel", for example. What method am I looking for? Thanks!
There are a lot of good answers in this thread but I also think they miss the real issue here.
When working with Unicode, you'll want to stop thinking in terms of indices, in order to produce correct code. Getting a sub-string by passing indices is a means to an end; you'll want to replace the index paradigm with an approach that does not rely on indices.
Working with indices where the smallest unit is a code point (dchar), which has been suggested in this thread, is still not good enough because you'll either a) potentially break up grapheme clusters, which can have disastrous results, or b) end up redundantly searching through the string to find the correct code point positions.
With Phobos, you can use algorithms such as `find` and the `findSplit` family of algorithms to do string manipulation without using indices, and the cool thing about this approach is that as long as the input strings are properly formed UTF and have intact grapheme clusters, it's impossible to get Unicode-incorrect results!
When working with ASCII, just slice.
|
October 27, 2013 Re: How to get a substring? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jakob Ovrum | On Sunday, 27 October 2013 at 07:44:06 UTC, Jakob Ovrum wrote:
> On Saturday, 26 October 2013 at 21:23:13 UTC, Gautam Goel wrote:
>> Dumb Newbie Question: I've searched through the library reference, but I haven't figured out how to extract a substring from a string. I'd like something like string.substring("Hello", 0, 2) to return "Hel", for example. What method am I looking for? Thanks!
>
> There are a lot of good answers in this thread but I also think they miss the real issue here.
>
I don't think so. It's indeed worth noticing that Phobos' algorithms work with Unicode nicely, but:
a) working on indices is sometimes the actual functionality you need
b) you need to allocate a new string from the range they return (the slice functions in this thread don't)
c) do they really handle grapheme clusters? (I don't know)
|
October 27, 2013 Re: How to get a substring? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nicolas Sicard | On Sunday, 27 October 2013 at 08:14:30 UTC, Nicolas Sicard wrote: > I don't think so. It's indeed worth noticing that Phobos' algorithms work with Unicode nicely, but: > a) working on indices is sometimes the actual functionality you need It is a means to an end. I'm saying it can be replaced with a much superior approach. > b) you need to allocate a new string from the range they return (the slice functions in this thread don't) That's only the case with strictly lazy algorithms. The algorithms I pointed out are eager. > c) do they really handle grapheme clusters? (I don't know) Rendering code and such is just about the only domain that needs to parse grapheme clusters. However, that doesn't mean that naive string manipulation code that uses indices can't break the string horribly before sending it off to rendering. |
October 27, 2013 Re: How to get a substring? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nicolas Sicard | On Sunday, October 27, 2013 09:14:28 Nicolas Sicard wrote: > On Sunday, 27 October 2013 at 07:44:06 UTC, Jakob Ovrum wrote: > > On Saturday, 26 October 2013 at 21:23:13 UTC, Gautam Goel wrote: > >> Dumb Newbie Question: I've searched through the library reference, but I haven't figured out how to extract a substring from a string. I'd like something like string.substring("Hello", 0, 2) to return "Hel", for example. What method am I looking for? Thanks! > > > > There are a lot of good answers in this thread but I also think they miss the real issue here. > > I don't think so. It's indeed worth noticing that Phobos' > algorithms work with Unicode nicely, but: > a) working on indices is sometimes the actual functionality you > need Sometimes, but it usually isn't. If you find that you frequently need to use indices for a string, then you should probably rethink how you're using strings. Phobos aims at operating on ranges, which rarely means using indices, and _very_ rarely means using indices on strings. In general, indices only get used on strings when you're trying to optimize a particular algorithm for strings and make sure that you slice the string so that the result is a string rather than a wrapper range. Sure, indexing strings can be very useful, but they way that Phobos is designed does not lend itself to using string indices (quite the opposite in fact), and in my experince, using string indices is rarely needed even when doing heavy string manipulation. > b) you need to allocate a new string from the range they return > (the slice functions in this thread don't) You _rarely_ want to do that. Allocating a new string is just plain wasteful in most cases. The fact that the elements in a string are immutable makes it so that you can slice without worrying about allocating new strings. You should pretty much only be allocating new strings when slicing when the original was something like char[] rather than string. The main place where that's usually forced is when reading from a file (since buffers are frequently reused when reading files and therefore not immutable). > c) do they really handle grapheme clusters? (I don't know) I believe that that sort of thing is properly supported by the updated std.uni in 2.064, but it is the sort of thing that you have to code for. Phobos as a whole operates on ranges of dchar - which is correct most of the time but not enough when you need full-on grapheme support. I haven't yet looked in detail at what std.uni now provides though. I just know that it's added some grapheme support. - Jonathan M Davis |
October 27, 2013 Re: How to get a substring? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On Sunday, 27 October 2013 at 08:35:11 UTC, Jonathan M Davis wrote: > On Sunday, October 27, 2013 09:14:28 Nicolas Sicard wrote: >> On Sunday, 27 October 2013 at 07:44:06 UTC, Jakob Ovrum wrote: >> > On Saturday, 26 October 2013 at 21:23:13 UTC, Gautam Goel wrote: >> >> Dumb Newbie Question: I've searched through the library >> >> reference, but I haven't figured out how to extract a >> >> substring from a string. I'd like something like >> >> string.substring("Hello", 0, 2) to return "Hel", for example. >> >> What method am I looking for? Thanks! >> > >> > There are a lot of good answers in this thread but I also think >> > they miss the real issue here. >> >> I don't think so. It's indeed worth noticing that Phobos' >> algorithms work with Unicode nicely, but: >> a) working on indices is sometimes the actual functionality you >> need > > Sometimes, but it usually isn't. If you find that you frequently need to use > indices for a string, then you should probably rethink how you're using > strings. Phobos aims at operating on ranges, which rarely means using indices, > and _very_ rarely means using indices on strings. In general, indices only get > used on strings when you're trying to optimize a particular algorithm for > strings and make sure that you slice the string so that the result is a string > rather than a wrapper range. +1 Also, I think if users need to get UTF indices in their own code, it's indicative of either a) Phobos lacking an (optimized) algorithm, or b) the user is doing something extremely niche that Phobos can't aim to cover generically. > Sure, indexing strings can be very useful, but they way that Phobos is > designed does not lend itself to using string indices (quite the opposite in > fact), and in my experince, using string indices is rarely needed even when > doing heavy string manipulation. And Phobos is better off for it! I don't know if we do a good enough job of educating users about Unicode and its implications though, assuming this is a responsibility of the D community towards new D users. >> c) do they really handle grapheme clusters? (I don't know) > > I believe that that sort of thing is properly supported by the updated std.uni > in 2.064, but it is the sort of thing that you have to code for. Phobos as a > whole operates on ranges of dchar - which is correct most of the time but not > enough when you need full-on grapheme support. I haven't yet looked in detail > at what std.uni now provides though. I just know that it's added some grapheme > support. > > - Jonathan M Davis The new std.uni supports all the grapheme-related functionality you would ever need (as far as I can tell), but the nice thing is that most code doesn't need to use it to be Unicode-correct. i.e. you don't need to be "aware" of grapheme clusters to not break them in the vast majority of code domains. |
October 27, 2013 Re: How to get a substring? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jakob Ovrum | On Sunday, 27 October 2013 at 08:53:48 UTC, Jakob Ovrum wrote:
> The new std.uni supports all the grapheme-related functionality you would ever need (as far as I can tell), but the nice thing is that most code doesn't need to use it to be Unicode-correct. i.e. you don't need to be "aware" of grapheme clusters to not break them in the vast majority of code domains.
Actually, I think that normalization of the input strings might be a common requirement for Unicode-correct string manipulation. I guess that falls under grapheme-related functionality.
|
Copyright © 1999-2021 by the D Language Foundation