August 19, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sean Kelly | Which of course wouldn't really work if you had invisible characters like the soft-hyphen or the zero-width space, or if you had combining marks in a decomposed form. Le 2011-08-19 ? 10:36, Sean Kelly a ?crit : > The few times I used it were for trimming a buffer to some length for display purposes. > > Sent from my iPhone > > On Aug 19, 2011, at 5:41 AM, Walter Bright <walter at digitalmars.com> wrote: > >> Sean Kelly wrote: >>> >>> I need to do this from time to time, but I generally just do something like: >>> >>> buf[0 .. buf.toUCSindex(n)] >>> >>> A shorthand might be nice though, I suppose. >>> >>> >> >> Somewhat surprisingly, such a function is rarely needed (I've never needed it in working with UTF8) >> and so I don't think a special syntax for it is justified. >> _______________________________________________ >> phobos mailing list >> phobos at puremagic.com >> http://lists.puremagic.com/mailman/listinfo/phobos > _______________________________________________ > phobos mailing list > phobos at puremagic.com > http://lists.puremagic.com/mailman/listinfo/phobos -- Michel Fortin michel.fortin at michelf.com http://michelf.com/ |
August 20, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to unDEFER | On Sat, 20 Aug 2011 02:24:41 +0900, unDEFER <undefer at gmail.com> wrote:
[snip]
> The fact which the next code
> ----
> writeln( arr.length );
> arr.popFront();
> writeln( arr.length );
> ----
> prints 9 after 10 for any array but for UTF-8 and UTF-16 strings may print as well 8 or lesser, seems too confusing for me.
You can use std.algorithm.count to count the number of elements.
assert([1,2,3].count() == 3);
assert("abc".count() == 3);
assert("???".count() == 3);
|
August 19, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to unDEFER | On Friday, August 19, 2011 03:07 unDEFER wrote: > On Fri, 19 Aug 2011 06:53:37 +0400, Jonathan M Davis <jmdavisProg at gmx.com> > > wrote: > > Hmmm. Such a function isn't entirely a bad idea, but it also makes me a > > bit > > nervous. Slicing is efficient. The slice function that you suggest is > > not. I > > mean, it's efficient enough for what it's doing, but it's not O(1) like > > slicing > > is, so having a slice function could be a bit misleading. > > I know that it is not efficient, but here just appears the question why D have decided not support 8-but encodings. Only its makes operations like this efficient. > > > Once drop has been merged in, you'll be able do to this > > auto s = takeExactly(drop(str, firstIndex), lastIndex - firstIndex)); > > to get the same effect. It may be worth adding such a function though. > > I'm sorry, but looks like there is no "drop()" function. > Anyway, thank you. I really don't understand how takeExactly works, but it > works. For newbies it is really not obvious that std.range works fine with > UTF-8 strings. I said "once drop has been merged in, you'll be able to..." It's not in yet. There's a pull request for it (which was merged in this morning actually), and it's going to be in before the next release, but it's not in yet. std.range most definitely works with UTF-8 strings. _All_ strings are considered ranges of dchar. And as ranges, strings of char and wchar are not considered sliceable or random access, and they have no length property (as none of that works when multiple elements in the array make up a single element in the range). std.range.take creates a range with up to n elements of the range that it's given. It's not the same type as the original range, since it's lazy and takes elements from the original range only as you iterate it (it would take less than n elements from the range if there were fewer than n elements in the range, otherwise it takse no elements). std.range.takeExactly takes exactly n elements from the range, and if the range defines a length property, then it returns the exact same type. I was thinking that it managed to return the exact same type for strings as well, in spite of the fact that it has no length property, but it does not appear that it does. So, if you need the type to be string specifical yas opposed to a generic range of dchar, then takeExactly isn't going to work. You could call std.array.array on it to get a string again, but that's creating a new string, which obviously isn't as efficient. I would point out though that what's generally done when someone needs random access to a string is to use dstring. So, if you're really looking to take slices out of the middle of a string like that, it's better to just use dstring. It _is_ sliceable and has a length property, because each element in an array of dchar is a dchar, unlike arrays of char and wchar, where multiple elements are required to make a dchar. > > Certainly > > auto s = slice(firstIndex, lastIndex); > > is cleaner. If we add it though, then we should probably give it a > > different name. Maybe sliceByElementType? That does seem a bit long > > though, if accurate. That would make sense if we restricted it to strings, but if we added the function, it would be useful for any range which didn't define a length property, so we wouldn't be making it string-specific, and so subString wouldn't make any sense as a function name. Though, come to think of it, for any type of range other than an array of char or wchar, such a function would not be able to return the original type, so it's value is certainly less in the general case. Regardless, given the inefficiencies involved, I think that we should be discouraging taking random slices of strings or wstrings. There's no reason to make it so that you can't do it, but including a function in Phobos to do it makes it overly easy IMHO. Someone who needs to be taking slices from the middle of strings like that really should be using dstrings in most cases. If it's a bit ugly to slice the middle of a string, that's probably a good thing. As Sean pointed out, std.utf.toUCSindex (which should probably be renamed to toUCSIndex to be properly camelcased, but I don't know if we'll fix that or not) will give you the index into the string that you need. auto firstIndex = str.toUCSindex(7); auto lastIndex = str[firstIndex .. $].toUCSindex(8); auto slice = str[firstIndex .. lastIndex]; should give you the equivalent of str[7 .. 15] if str were a dstring. You could also do it as auto slice = str[str.toUCSindex(7) .. str.toUCSindex(15]; which would be clearer, but it would also be less efficient. So, we _might_ add a slicing function to Phobos, but I'm skepitical of the wisdom of making it that easy to slice a string or wstring like that given how inefficient it is. std.utf already makes it possible in as efficient a manner as is possible - just not in as concise a way - and if you're really taking slices out of the middle of a string, you really should be doing it with dstrings. It's far more efficient that way. - Jonathan M Davis |
August 19, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to Masahiro Nakagawa | On Friday, August 19, 2011 11:42 Masahiro Nakagawa wrote:
> On Sat, 20 Aug 2011 02:24:41 +0900, unDEFER <undefer at gmail.com> wrote:
>
> [snip]
>
> > The fact which the next code
> > ----
> > writeln( arr.length );
> > arr.popFront();
> > writeln( arr.length );
> > ----
> > prints 9 after 10 for any array but for UTF-8 and UTF-16 strings may print as well 8 or lesser, seems too confusing for me.
>
> You can use std.algorithm.count to count the number of elements.
>
> assert([1,2,3].count() == 3);
> assert("abc".count() == 3);
> assert("???".count() == 3);
The correct function for getting the number of elements for a range is std.ronge.walkLength. count will call its predicate (which defaults to "true") on every member of the range. walkLength, on the other hand, will call the range's length property if it has one (string and wstring don't have a length property as far as ranges are concerned, because they're ranges of dchar, not char or wchar) and simply iterates through the range, counting its elements otherwise. So, it will be more efficient to call walkLength, and that's what it's for. count is for counting the number of elements in the range which match its predicate, not for counting the number of elements in the range.
- Jonathan M Davis
|
August 19, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to unDEFER |
unDEFER wrote:
>
> The fact which the next code
> ----
> writeln( arr.length );
> arr.popFront();
> writeln( arr.length );
> ----
> prints 9 after 10 for any array but for UTF-8 and UTF-16 strings may print as well 8 or lesser, seems too confusing for me.
>
There isn't any getting away from understanding that UTF-8 is a multi-byte encoding. If you want to use an encoding with a 1:1 correspondence between indices and characters, use dchar encoding.
|
August 19, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to unDEFER | On 08/18/2011 02:21 AM, unDEFER wrote: > Hello! > > D language specification says that it supports UTF-8 strings, but I can't find how to slice UTF-8 string by character index, not by bytes numbers. Why there is no simple slice function in std.utf like attached code? BTW: your code is flawed. Feed it some of the stuff near the end of this post and it will fail: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 tl;dr; your code doesn't slice on characters but something called (IIRC) code points. If you start worrying about diacritic (and many end user will want you to) you need to do a bunch more processing. http://en.wikipedia.org/wiki/Diacritic > Thank you in advance. > > > _______________________________________________ > phobos mailing list > phobos at puremagic.com > http://lists.puremagic.com/mailman/listinfo/phobos -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20110819/261f6574/attachment.html> |
August 19, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to Benjamin Shropshire | On Friday, August 19, 2011 19:58:34 Benjamin Shropshire wrote:
> On 08/18/2011 02:21 AM, unDEFER wrote:
> > Hello!
> >
> > D language specification says that it supports UTF-8 strings, but I
> > can't
> > find how to slice UTF-8 string by character index, not by bytes numbers.
> > Why there is no simple slice function in std.utf like attached code?
>
> BTW: your code is flawed. Feed it some of the stuff near the end of this post and it will fail:
>
> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtm l-self-contained-tags/1732454#1732454
>
> tl;dr; your code doesn't slice on characters but something called (IIRC)
> code points. If you start worrying about diacritic (and many end user
> will want you to)
> you need to do a bunch more processing.
>
> http://en.wikipedia.org/wiki/Diacritic
His code works as well as slicing a dstring does - save for the efficiency issues. There is no way in Phobos at present to deal with graphemes. All of the string processing in Phobos deals with code points. For the most part, this works great, but it is true that it isn't complete. I expect that we'll get grapheme support eventually (Ibelieve that Dmitry has done some work on a grapheme range for the updates that he's been doing to std.regex for GSoC, so we may get it from there). But for now, none of the string processing in D worries about graphemes - just code points.
- Jonathan M Davis
|
August 20, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On Sat, 20 Aug 2011 05:22:38 +0900, Jonathan M Davis <jmdavisProg at gmx.com> wrote:
> On Friday, August 19, 2011 11:42 Masahiro Nakagawa wrote:
>> On Sat, 20 Aug 2011 02:24:41 +0900, unDEFER <undefer at gmail.com> wrote:
>>
>> [snip]
>>
>> > The fact which the next code
>> > ----
>> > writeln( arr.length );
>> > arr.popFront();
>> > writeln( arr.length );
>> > ----
>> > prints 9 after 10 for any array but for UTF-8 and UTF-16 strings may print as well 8 or lesser, seems too confusing for me.
>>
>> You can use std.algorithm.count to count the number of elements.
>>
>> assert([1,2,3].count() == 3);
>> assert("abc".count() == 3);
>> assert("???".count() == 3);
>
> The correct function for getting the number of elements for a range is
> std.ronge.walkLength. count will call its predicate (which defaults to
> "true")
> on every member of the range. walkLength, on the other hand, will call
> the
> range's length property if it has one (string and wstring don't have a
> length
> property as far as ranges are concerned, because they're ranges of
> dchar, not
> char or wchar) and simply iterates through the range, counting its
> elements
> otherwise. So, it will be more efficient to call walkLength, and that's
> what
> it's for. count is for counting the number of elements in the range which
> match its predicate, not for counting the number of elements in the
> range.
Yes, I know.
|
August 20, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to Masahiro Nakagawa | On Saturday, August 20, 2011 15:50:37 Masahiro Nakagawa wrote:
> On Sat, 20 Aug 2011 05:22:38 +0900, Jonathan M Davis <jmdavisProg at gmx.com>
>
> wrote:
> > On Friday, August 19, 2011 11:42 Masahiro Nakagawa wrote:
> >> On Sat, 20 Aug 2011 02:24:41 +0900, unDEFER <undefer at gmail.com> wrote:
> >>
> >> [snip]
> >>
> >> > The fact which the next code
> >> > ----
> >> > writeln( arr.length );
> >> > arr.popFront();
> >> > writeln( arr.length );
> >> > ----
> >> > prints 9 after 10 for any array but for UTF-8 and UTF-16 strings
> >> > may
> >> > print as well 8 or lesser, seems too confusing for me.
> >>
> >> You can use std.algorithm.count to count the number of elements.
> >>
> >> assert([1,2,3].count() == 3);
> >> assert("abc".count() == 3);
> >> assert("???".count() == 3);
> >
> > The correct function for getting the number of elements for a range is
> > std.ronge.walkLength. count will call its predicate (which defaults to
> > "true")
> > on every member of the range. walkLength, on the other hand, will call
> > the
> > range's length property if it has one (string and wstring don't have a
> > length
> > property as far as ranges are concerned, because they're ranges of
> > dchar, not
> > char or wchar) and simply iterates through the range, counting its
> > elements
> > otherwise. So, it will be more efficient to call walkLength, and that's
> > what
> > it's for. count is for counting the number of elements in the range
> > which
> > match its predicate, not for counting the number of elements in the
> > range.
>
> Yes, I know.
Then I don't understand why you were suggesting that he use count rather than walkLength.
- Jonathan M Davis
|
August 20, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | Big thanks, Jonathan! You give me very clearly explanations. But what you mean by "strings of char and wchar ... have no length property" if "string.length" really works? Is it a bug? On Sat, 20 Aug 2011 00:22:24 +0400, Jonathan M Davis <jmdavisProg at gmx.com> wrote: > And as ranges, strings of char and wchar are not > considered sliceable or random access, and they have no length property > (as none of that works when multiple elements in the array make up a > single > element in the range). -- Nikolay Krivchenkov aka unDEFER I want to believe... in unDE.su registered Linux user #360474 Don't worry, I can read OpenOffice.org/Libre Office/Lotus Symphony documents |
Copyright © 1999-2021 by the D Language Foundation