Jump to page: 1 2 3
Thread overview
[phobos] UTF-8 string slicing
Aug 18, 2011
unDEFER
Aug 19, 2011
Jonathan M Davis
Aug 19, 2011
Sean Kelly
Aug 19, 2011
unDEFER
Aug 19, 2011
Walter Bright
Aug 19, 2011
Sean Kelly
Aug 19, 2011
Michel Fortin
Aug 19, 2011
SHOO
Aug 19, 2011
unDEFER
Aug 19, 2011
Masahiro Nakagawa
Aug 19, 2011
Jonathan M Davis
Aug 20, 2011
Masahiro Nakagawa
Aug 20, 2011
Jonathan M Davis
Aug 20, 2011
Masahiro Nakagawa
Aug 20, 2011
Walter Bright
Aug 20, 2011
unDEFER
Aug 20, 2011
Jacob Carlborg
Aug 20, 2011
Walter Bright
Aug 19, 2011
unDEFER
Aug 19, 2011
Martin Nowak
Aug 19, 2011
Jonathan M Davis
Aug 20, 2011
unDEFER
Aug 20, 2011
Jonathan M Davis
Aug 21, 2011
unDEFER
Aug 21, 2011
Jonathan M Davis
Aug 20, 2011
Jonathan M Davis
August 18, 2011
Hello!

D language specification says that it supports UTF-8 strings, but I can't find how to slice UTF-8 string by character index, not by bytes numbers. Why there is no simple slice function in std.utf like attached code?

Thank you in advance.

-- 
Nikolay Krivchenkov aka unDEFER
registered Linux user #360474
Don't worry, I can read OpenOffice.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slice-utf-8.d
Type: application/octet-stream
Size: 895 bytes
Desc: not available
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20110818/0cab1096/attachment.obj>
August 18, 2011
On Thursday, August 18, 2011 13:21:29 unDEFER wrote:
> Hello!
> 
> D language specification says that it supports UTF-8 strings, but I can't find how to slice UTF-8 string by character index, not by bytes numbers. Why there is no simple slice function in std.utf like attached code?
> 
> Thank you in advance.

Hmmm. Such a function isn't entirely a bad idea, but it also makes me a bit nervous. Slicing is efficient. The slice function that you suggest is not. I mean, it's efficient enough for what it's doing, but it's not O(1) like slicing is, so having a slice function could be a bit misleading.

Once drop has been merged in, you'll be able do to this

auto s = takeExactly(drop(str, firstIndex), lastIndex - firstIndex));

to get the same effect. It may be worth adding such a function though. Certainly

auto s = slice(firstIndex, lastIndex);

is cleaner. If we add it though, then we should probably give it a different name. Maybe sliceByElementType? That does seem a bit long though, if accurate. We'd probably put it in std.range though rather than std.utf, since it could be useful for any range which isn't actually sliceable. And then there's the question of whether it would be better to make it lazy. It would make it so that it wasn't actually a string anymore, but it would make it more efficient for all of the cases where you don't actually end up using the whole slice.

You can make a pull request for it if you want to, and the best way to handle it - as well as whether we actually want such a function - can be discussed in the pull request. I do think that some thought is going to have to go into what behavior we really want such a function to have though (as well as the best name for it).

- Jonathan M Davis
August 18, 2011
On Aug 18, 2011, at 7:53 PM, Jonathan M Davis wrote:

> On Thursday, August 18, 2011 13:21:29 unDEFER wrote:
>> Hello!
>> 
>> D language specification says that it supports UTF-8 strings, but I can't find how to slice UTF-8 string by character index, not by bytes numbers. Why there is no simple slice function in std.utf like attached code?
>> 
>> Thank you in advance.
> 
> Hmmm. Such a function isn't entirely a bad idea, but it also makes me a bit nervous. Slicing is efficient. The slice function that you suggest is not. I mean, it's efficient enough for what it's doing, but it's not O(1) like slicing is, so having a slice function could be a bit misleading.

I need to do this from time to time, but I generally just do something like:

buf[0 .. buf.toUCSindex(n)]

A shorthand might be nice though, I suppose.

August 19, 2011
On Fri, 19 Aug 2011 06:53:37 +0400, Jonathan M Davis <jmdavisProg at gmx.com> wrote:

> Hmmm. Such a function isn't entirely a bad idea, but it also makes me a
> bit
> nervous. Slicing is efficient. The slice function that you suggest is
> not. I
> mean, it's efficient enough for what it's doing, but it's not O(1) like
> slicing
> is, so having a slice function could be a bit misleading.

I know that it is not efficient, but here just appears the question why D have decided not support 8-but encodings. Only its makes operations  like this efficient.

> Once drop has been merged in, you'll be able do to this
> auto s = takeExactly(drop(str, firstIndex), lastIndex - firstIndex));
> to get the same effect. It may be worth adding such a function though.

I'm sorry, but looks like there is no "drop()" function.
Anyway, thank you. I really don't understand how takeExactly works, but it
works. For newbies it is really not obvious that std.range works fine with
UTF-8 strings.

> Certainly
> auto s = slice(firstIndex, lastIndex);
> is cleaner. If we add it though, then we should probably give it a
> different name. Maybe sliceByElementType? That does seem a bit long
> though, if accurate.

In many other languages this function named as "subString".

> We'd probably put it in std.range though rather than std.utf, since it
> could
> be useful for any range which isn't actually sliceable. And then there's
> the
> question of whether it would be better to make it lazy. It would make it
> so
> that it wasn't actually a string anymore, but it would make it more
> efficient for all of the cases where you don't actually end up using the
> whole slice.
>
> You can make a pull request for it if you want to, and the best way to handle it - as well as whether we actually want such a function - can be discussed in the pull request. I do think that some thought is going to have to go into what behavior we really want such a function to have though (as well as the best name for it).

I'm not familiar with Git, but I'll try to think up anything.

-- 
registered Linux user #360474
Don't worry, I can read OpenOffice.org
August 19, 2011
On Fri, 19 Aug 2011 07:06:53 +0400, Sean Kelly <sean at invisibleduck.org> wrote:

> I need to do this from time to time, but I generally just do something like:
>
> buf[0 .. buf.toUCSindex(n)]
>
> A shorthand might be nice though, I suppose.

Hm.. I don't know how it works for you, but for me this code doesn't work at all.

-- 
registered Linux user #360474
Don't worry, I can read OpenOffice.org
August 19, 2011

Sean Kelly wrote:
>
> I need to do this from time to time, but I generally just do something like:
>
> buf[0 .. buf.toUCSindex(n)]
>
> A shorthand might be nice though, I suppose.
>
> 

Somewhat surprisingly, such a function is rarely needed (I've never
needed it in working with UTF8)
and so I don't think a special syntax for it is justified.
August 19, 2011
This is because popFront and front actually decode unicode chars. So takeExactly works on 32-bit dchars, the ElementType of all strings.

On Fri, 19 Aug 2011 12:07:53 +0200, unDEFER <undefer at gmail.com> wrote:

> On Fri, 19 Aug 2011 06:53:37 +0400, Jonathan M Davis <jmdavisProg at gmx.com> wrote:
>
>> Hmmm. Such a function isn't entirely a bad idea, but it also makes me a
>> bit
>> nervous. Slicing is efficient. The slice function that you suggest is
>> not. I
>> mean, it's efficient enough for what it's doing, but it's not O(1) like
>> slicing
>> is, so having a slice function could be a bit misleading.
>
> I know that it is not efficient, but here just appears the question why D have decided not support 8-but encodings. Only its makes operations like this efficient.
>
>> Once drop has been merged in, you'll be able do to this
>> auto s = takeExactly(drop(str, firstIndex), lastIndex - firstIndex));
>> to get the same effect. It may be worth adding such a function though.
>
> I'm sorry, but looks like there is no "drop()" function.
> Anyway, thank you. I really don't understand how takeExactly works, but
> it works. For newbies it is really not obvious that std.range works fine
> with UTF-8 strings.
>
>> Certainly
>> auto s = slice(firstIndex, lastIndex);
>> is cleaner. If we add it though, then we should probably give it a
>> different name. Maybe sliceByElementType? That does seem a bit long
>> though, if accurate.
>
> In many other languages this function named as "subString".
>
>> We'd probably put it in std.range though rather than std.utf, since it
>> could
>> be useful for any range which isn't actually sliceable. And then
>> there's the
>> question of whether it would be better to make it lazy. It would make
>> it so
>> that it wasn't actually a string anymore, but it would make it more
>> efficient for all of the cases where you don't actually end up using
>> the whole slice.
>>
>> You can make a pull request for it if you want to, and the best way to handle it - as well as whether we actually want such a function - can be discussed in the pull request. I do think that some thought is going to have to go into what behavior we really want such a function to have though (as well as the best name for it).
>
> I'm not familiar with Git, but I'll try to think up anything.
>


August 19, 2011
The few times I used it were for trimming a buffer to some length for display purposes.

Sent from my iPhone

On Aug 19, 2011, at 5:41 AM, Walter Bright <walter at digitalmars.com> wrote:

> 
> 
> Sean Kelly wrote:
>> 
>> I need to do this from time to time, but I generally just do something like:
>> 
>> buf[0 .. buf.toUCSindex(n)]
>> 
>> A shorthand might be nice though, I suppose.
>> 
>> 
> 
> Somewhat surprisingly, such a function is rarely needed (I've never needed it in working with UTF8)
> and so I don't think a special syntax for it is justified.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
August 19, 2011
I agree. The special syntax is unnecessary.
I usually used Japanese, but slice of the UTF-8 string has not become
the problem.
When it is necessary that it looks like it, it is effective to use
dstring(UTF-32).

When I slice it in UTF-8 including the multi-byte character string,
the delimiter is an ASCII code in most cases.
Otherwise, I think that I do not need the special syntax because it is
considerably special processing. (e.g. Regex)

2011/8/19 Walter Bright <walter at digitalmars.com>:
>
>
> Sean Kelly wrote:
>>
>> I need to do this from time to time, but I generally just do something like:
>>
>> buf[0 .. buf.toUCSindex(n)]
>>
>> A shorthand might be nice though, I suppose.
>>
>>
>
> Somewhat surprisingly, such a function is rarely needed (I've never needed
> it in working with UTF8)
> and so I don't think a special syntax for it is justified.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
August 19, 2011
Maybe it is so.
We have 3 methods to slice UTF-8:

string substr = str[str.toUTFindex(from) .. str.toUTFindex(to)]
//UTF index, not UCS like Sean Kelly wrote
or
string substr = toUTF8(array(takeExactly(drop(str, from), to -
firstIndex)));
or
string substr = toUTF8(toUTF32(str)[from..to]);

But anyway the documentation must be more obvious in this part. I have
learn documentation of D language for 3 days, but I don't understand what
means UTF-8 support from this..
Now I can't to understand how to difference methods which works with
strings at UTF-8 symbols level, and methods which works at bytes level.
The fact which the next code
----
writeln( arr.length );
arr.popFront();
writeln( arr.length );
----
prints 9 after 10 for any array but for UTF-8 and UTF-16 strings may print as well 8 or lesser, seems too confusing for me.

On Fri, 19 Aug 2011 18:38:21 +0400, SHOO <zan77137 at nifty.com> wrote:

> I agree. The special syntax is unnecessary.
> I usually used Japanese, but slice of the UTF-8 string has not become
> the problem.
> When it is necessary that it looks like it, it is effective to use
> dstring(UTF-32).
>
> When I slice it in UTF-8 including the multi-byte character string,
> the delimiter is an ASCII code in most cases.
> Otherwise, I think that I do not need the special syntax because it is
> considerably special processing. (e.g. Regex)
>
> 2011/8/19 Walter Bright <walter at digitalmars.com>:
>>
>>
>> Sean Kelly wrote:
>>>
>>> I need to do this from time to time, but I generally just do something like:
>>>
>>> buf[0 .. buf.toUCSindex(n)]
>>>
>>> A shorthand might be nice though, I suppose.
>>>
>>>
>>
>> Somewhat surprisingly, such a function is rarely needed (I've never
>> needed
>> it in working with UTF8)
>> and so I don't think a special syntax for it is justified.
>> _______________________________________________
>> phobos mailing list
>> phobos at puremagic.com
>> http://lists.puremagic.com/mailman/listinfo/phobos

-- 
registered Linux user #360474
Don't worry, I can read OpenOffice.org
« First   ‹ Prev
1 2 3