August 20, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Sat, 20 Aug 2011 06:49:33 +0400, Walter Bright <walter at digitalmars.com> wrote: > There isn't any getting away from understanding that UTF-8 is a multi-byte encoding. If it is so, then arr.popFront() must break UTF-8 strings ;-) > If you want to use an encoding with a 1:1 correspondence between indices and characters, use dchar encoding. For me use in 4 times more memory for ASCII seems too wasteful, sorry. Walter, I really very like your creation. It is great. Big thank you for it! I really believe that there is no bugs, only not documented features ;-) I just want to say that the documentation now give enough information. std.range or std.array documentation don't say anything about it's behaviour on UTF-8 strings. I'm already see source codes to know what really does any function. Open Source is really great :-) -- Nikolay Krivchenkov aka unDEFER I want to believe... in unDE.su registered Linux user #360474 Don't worry, I can read OpenOffice.org/Libre Office/Lotus Symphony documents |
August 20, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to unDEFER | On 20 aug 2011, at 11:38, unDEFER wrote: > On Sat, 20 Aug 2011 06:49:33 +0400, Walter Bright <walter at digitalmars.com> wrote: > >> There isn't any getting away from understanding that UTF-8 is a multi-byte encoding. > > If it is so, then arr.popFront() must break UTF-8 strings ;-) > >> If you want to use an encoding with a 1:1 correspondence between indices and characters, use dchar encoding. > > For me use in 4 times more memory for ASCII seems too wasteful, sorry. if you know for sure that a string will only contain ASCII you can use .length and the built-in slicing syntax. I doesn't matter what type of encoding you use with unicode, a non-ascii character will take up more than one byte. -- /Jacob Carlborg |
August 20, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to unDEFER | unDEFER wrote: > On Sat, 20 Aug 2011 06:49:33 +0400, Walter Bright <walter at digitalmars.com> wrote: > >> There isn't any getting away from understanding that UTF-8 is a multi-byte encoding. > > If it is so, then arr.popFront() must break UTF-8 strings ;-) > >> If you want to use an encoding with a 1:1 correspondence between indices and characters, use dchar encoding. > > For me use in 4 times more memory for ASCII seems too wasteful, sorry. Exactly - all I'm saying is that if you want the benefits of UTF-8 - low memory consumption *and* high speed processing, you have to be cognizant of its underlying storage scheme. In order to get a higher level of "I don't care how it is stored, I just want to pretend it's an array of Unicode characters", you'll have to give up one or more of efficiency and memory consumption. > > Walter, I really very like your creation. It is great. Big thank you > for it! > I really believe that there is no bugs, only not documented features ;-) > I just want to say that the documentation now give enough information. > std.range or std.array documentation don't say anything about it's > behaviour on UTF-8 strings. > I'm already see source codes to know what really does any function. > Open Source is really great :-) > I agree, open source can make up for gaps in the documentation. |
August 21, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On Sat, 20 Aug 2011 16:49:43 +0900, Jonathan M Davis <jmdavisProg at gmx.com> wrote:
> On Saturday, August 20, 2011 15:50:37 Masahiro Nakagawa wrote:
>> On Sat, 20 Aug 2011 05:22:38 +0900, Jonathan M Davis <jmdavisProg at gmx.com>
>>
>> wrote:
>> > On Friday, August 19, 2011 11:42 Masahiro Nakagawa wrote:
>> >> On Sat, 20 Aug 2011 02:24:41 +0900, unDEFER <undefer at gmail.com>
>> wrote:
>> >>
>> >> [snip]
>> >>
>> >> > The fact which the next code
>> >> > ----
>> >> > writeln( arr.length );
>> >> > arr.popFront();
>> >> > writeln( arr.length );
>> >> > ----
>> >> > prints 9 after 10 for any array but for UTF-8 and UTF-16 strings
>> >> > may
>> >> > print as well 8 or lesser, seems too confusing for me.
>> >>
>> >> You can use std.algorithm.count to count the number of elements.
>> >>
>> >> assert([1,2,3].count() == 3);
>> >> assert("abc".count() == 3);
>> >> assert("???".count() == 3);
>> >
>> > The correct function for getting the number of elements for a range is
>> > std.ronge.walkLength. count will call its predicate (which defaults to
>> > "true")
>> > on every member of the range. walkLength, on the other hand, will call
>> > the
>> > range's length property if it has one (string and wstring don't have a
>> > length
>> > property as far as ranges are concerned, because they're ranges of
>> > dchar, not
>> > char or wchar) and simply iterates through the range, counting its
>> > elements
>> > otherwise. So, it will be more efficient to call walkLength, and
>> that's
>> > what
>> > it's for. count is for counting the number of elements in the range
>> > which
>> > match its predicate, not for counting the number of elements in the
>> > range.
>>
>> Yes, I know.
>
> Then I don't understand why you were suggesting that he use count rather
> than
> walkLength.
I use sometimes count because walkLength is not clear name for me(and
short-name).
I remembered walkLength after send a reply :)
|
August 20, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to unDEFER | On Saturday, August 20, 2011 13:11:44 unDEFER wrote:
> Big thanks, Jonathan!
> You give me very clearly explanations.
> But what you mean by "strings of char and wchar ... have no length
> property" if "string.length" really works? Is it a bug?
All arrays have a length property. It returns the number of elements in the array. The issue is std.range.hasLength, which is what is used with range- based functions in template constraints and static ifs. hasLength is true for all arrays _except_ for arrays of char and wchar. This is because strings are ranges of dchar - of code points - whereas they are arrays of code units, and in UTF-8 and UTF-16, there can be more than one code unit per code point. In the general case, calling length on an array of char or wchar isn't going to give you the the number of code points in the array. So, it's normally incorrect to use length with arrays of char and wchar in range-based functions.
string str = "hello world";
assert(str.length == walkLength(str));
This works, because it only uses ASCII characters which all fit in one code unit. Whereas this doesn't
auto str = "??????";
assert(str.length == walkLength(str));
since the characters are more than one code unit each. walkLength uses the length property if hasLength is true, but otherwise it iterates over the whole array and counts how many elements that there are. So, in range-based functions, we use walkLength, not length, unless it is a section of code where we know though the range has a length property and that using it directly is correct (based on the template constraint and/or static ifs that the block of code is in).
- Jonathan M Davis
|
August 21, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | Thank you again, Jonathan. So "have no length property" means std.range.hasLength = false. Now I understand. On Sun, 21 Aug 2011 03:51:05 +0400, Jonathan M Davis <jmdavisProg at gmx.com> wrote: > On Saturday, August 20, 2011 13:11:44 unDEFER wrote: >> Big thanks, Jonathan! >> You give me very clearly explanations. >> But what you mean by "strings of char and wchar ... have no length >> property" if "string.length" really works? Is it a bug? > > All arrays have a length property. It returns the number of elements in > the > array. The issue is std.range.hasLength, which is what is used with > range- based functions in template constraints and static ifs. hasLength > is true for all arrays _except_ for arrays of char and wchar. This is > because strings are ranges of dchar - of code points - whereas they are > arrays of code units, and in UTF-8 and UTF-16, there can be more than > one code unit per code point. In the general case, calling length on an > array of char or wchar isn't going to give you the the number of code > points in the array. So, it's normally > incorrect to use length with arrays of char and wchar in range-based > functions. > > - Jonathan M Davis > _______________________________________________ > phobos mailing list > phobos at puremagic.com > http://lists.puremagic.com/mailman/listinfo/phobos -- Nikolay Krivchenkov aka unDEFER I want to believe... in unDE.su registered Linux user #360474 Don't worry, I can read OpenOffice.org/Libre Office/Lotus Symphony documents |
August 20, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to unDEFER | On Sunday, August 21, 2011 04:54:00 unDEFER wrote:
> Thank you again, Jonathan. So "have no length property" means std.range.hasLength = false. Now I understand.
Yeah. I probably should have been clearer about it in my original post. They have a length property, but as for as ranges are concerned, they don't. hasLength is the mechanism that that's checked with though, so perhaps I should use that when explaining. The same goes for isRandomAccessRange and isSliceable. They're false for arrays of char and wchar in spite of the fact that they're arrays and thus have random access and are sliceable, because they're random access and sliceable only on code units, not code points.
- Jonathan M Davis
|
August 21, 2011 [phobos] UTF-8 string slicing | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On 08/19/2011 08:40 PM, Jonathan M Davis wrote: > On Friday, August 19, 2011 19:58:34 Benjamin Shropshire wrote: >> On 08/18/2011 02:21 AM, unDEFER wrote: >>> Hello! >>> >>> D language specification says that it supports UTF-8 strings, but I >>> can't >>> find how to slice UTF-8 string by character index, not by bytes numbers. >>> Why there is no simple slice function in std.utf like attached code? >> BTW: your code is flawed. Feed it some of the stuff near the end of this post and it will fail: >> >> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtm l-self-contained-tags/1732454#1732454 >> >> tl;dr; your code doesn't slice on characters but something called (IIRC) >> code points. If you start worrying about diacritic (and many end user >> will want you to) >> you need to do a bunch more processing. >> >> http://en.wikipedia.org/wiki/Diacritic > His code works as well as slicing a dstring does - save for the efficiency issues. There is no way in Phobos at present to deal with graphemes. All of the string processing in Phobos deals with code points. For the most part, this works great, but it is true that it isn't complete. I expect that we'll get grapheme support eventually (Ibelieve that Dmitry has done some work on a grapheme range for the updates that he's been doing to std.regex for GSoC, so we may get it from there). But for now, none of the string processing in D worries about graphemes - just code points. My thought on that subject is: I can see good reason to index on proper characters (get the 4th char in the word), good reason to index to a character (or sometimes a code point) near some byte position and there are clearly good reason to iterate thought code points, but I don't see much value to be had from asking for a random Nth code point that can't be had via something that has fewer problem and/or is cheaper. > - Jonathan M Davis > _______________________________________________ > phobos mailing list > phobos at puremagic.com > http://lists.puremagic.com/mailman/listinfo/phobos |
Copyright © 1999-2021 by the D Language Foundation