Python-like slicing and handling UTF-8 strings as a bonus

Dec 29, 2012

Vladimir Panteleev

Dec 29, 2012

bearophile

Dec 29, 2012

Peter Alexander

Dec 29, 2012

bearophile

Dec 30, 2012

monarch_dodra

December 29, 2012

Python-like slicing and handling UTF-8 strings as a bonus

Posted by FG

Permalink

Slices are great but not really what I had expected, coming from Python.
I've seen code like s[a..$-b] used without checking the values, just to end up with a Range violation. But there are 3 constraints to check here:
	a >= 0 && a + b <= s.length && b >= 0

That's way too much coding for a simple program/script that shortens a string, before it prints it on a screen. If I can't write s[0..80] without fear, then let there at least be a function that does it like Python would.

Additionally, as strings are UTF-8-encoded, I'd like such a function to give me proper substrings, without multibyte characters cut in the middle, where s[0..80] would mean 80 characters on the screen and not 80 bytes.

I would envision it being part of std.string eventually.
Forgive me if such a function already exists -- I couldn't find it.
I also still don't speak D too well, so don't laugh. :)




import std.array, std.range, std.stdio;


auto getSlice(T)(T[] s, ptrdiff_t start, ptrdiff_t end = ptrdiff_t.max)
pure @safe
{
    bool start_from_back, end_from_back;
    size_t full_len = s.length;
    ptrdiff_t len;
    if (full_len > ptrdiff_t.max)
        len = ptrdiff_t.max;
    else len = cast(ptrdiff_t) full_len;
    if (end < 0)
    {
        end_from_back = true;
        end += len;
    }
    if (end > len) end = len;
    if (start < 0)
    {
        if (0 - start >= len)
            start = 0;
        else
        {
            start += len;
            start_from_back = true;
        }
    }
    if (start < 0) start = 0;
    if (start > end || start >= len || end <= 0)
        return s[0..0];

    static if(is(T == char) || is(T == immutable(char)) ||
            is(T : wchar) || is(T : immutable(wchar)))
    {
        ptrdiff_t real_start = -1, real_end = -1, loop, last_pos;
        if (!start_from_back || !end_from_back)
        {
            foreach (ptrdiff_t i, dchar c; s)
            {
                if (!start_from_back && loop >= start && real_start < 0)
                    real_start = i;
                if (!end_from_back && loop >= end && real_end < 0)
                    real_end = i;
                if ((start_from_back || real_start > -1) &&
                        (end_from_back || real_end > -1 || end == len))
                    break;
                loop++;
            }
        }
        start -= len;
        end -= len;
        loop = -1;
        if (start_from_back || end_from_back)
        {
            foreach_reverse (ptrdiff_t i, dchar c; s)
            {
                if (start_from_back && loop <= start && real_start < 0)
                    real_start = i;
                if (end_from_back && loop <= end && real_end < 0)
                    real_end = i;
                if ((!start_from_back || real_start > -1) &&
                        (!end_from_back || real_end > -1))
                    break;
                loop--;
            }
        }
        if (real_end < 0) real_end = (end_from_back ? 0 : len);
        if (real_start < 0) real_start = (start_from_back ? 0 : len);
        if (real_start > real_end) real_start = real_end = 0;
        return s[real_start..real_end];
    }
    else return s[start..end];
}

unittest {
    string s = "okrągły stół";
    dstring d = "okrągły stół"d;
    auto t = [0, 1, 2, 3, 4];
    assert(t.getSlice(0, -1) == [0, 1, 2, 3]);
    assert(t.getSlice(1, -2) == [1, 2]);
    assert(t.getSlice(-4, -2) == [1, 2]);
    assert(t.getSlice(-5, 7) == [0, 1, 2, 3, 4]);
    assert(s.getSlice(0, 0) == "");
    assert(s.getSlice(0, 1) == "o");
    assert(s.getSlice(0) == s);
    assert(s.getSlice(8) == "stół");
    assert(s.getSlice(8, -1) == "stó");
    assert(s.getSlice(8, -2) == "st");
    assert(s.getSlice(8, -4) == "");
    assert(s.getSlice(10, 11) == "ó");
    assert(s.getSlice(10, -1) == "ó");
    assert(s.getSlice(10, 12) == "ół");
    assert(s.getSlice(11, 12) == "ł");
    assert(s.getSlice(11, 15) == "ł");
    assert(d.getSlice(0, 0) == ""d);
    assert(d.getSlice(0, 1) == "o"d);
    assert(d.getSlice(0) == d);
    assert(d.getSlice(8) == "stół"d);
    assert(d.getSlice(8, -1) == "stó"d);
    assert(d.getSlice(8, -2) == "st"d);
    assert(d.getSlice(8, -4) == ""d);
    assert(d.getSlice(10, 11) == "ó"d);
    assert(d.getSlice(10, -1) == "ó"d);
    assert(d.getSlice(10, 12) == "ół"d);
    assert(d.getSlice(11, 12) == "ł"d);
    assert(d.getSlice(11, 15) == "ł"d);
    assert(d.getSlice(11, 15) == "ł"d);
}

On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote: > Slices are great but not really what I had expected, coming from Python. > I've seen code like s[a..$-b] used without checking the values, just to end up with a Range violation. But there are 3 constraints to check here: > a >= 0 && a + b <= s.length && b >= 0 > > That's way too much coding for a simple program/script that shortens a string, before it prints it on a screen. If I can't write s[0..80] without fear, then let there at least be a function that does it like Python would. Why? > Additionally, as strings are UTF-8-encoded, I'd like such a function to give me proper substrings, without multibyte characters cut in the middle, where s[0..80] would mean 80 characters on the screen and not 80 bytes. This is a common fallacy when dealing with Unicode. Please see the linked and the following points: http://utf8everywhere.org/#myth.utf32.o1

On 2012-12-29 23:35, Vladimir Panteleev wrote: > On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote: >> Slices are great but not really what I had expected, coming from Python. >> I've seen code like s[a..$-b] used without checking the values, just to end up >> with a Range violation. But there are 3 constraints to check here: >> a >= 0 && a + b <= s.length && b >= 0 >> >> That's way too much coding for a simple program/script that shortens a string, >> before it prints it on a screen. If I can't write s[0..80] without fear, then >> let there at least be a function that does it like Python would. > > Why? Probably because I like concise code. I always prefer: if (A) print(getMessage().getSlice(0..100)); to writing something like this: auto message = getMessage(); if (A) print(message.length > 100 ? message[0..100] : message); >> Additionally, as strings are UTF-8-encoded, I'd like such a function to give >> me proper substrings, without multibyte characters cut in the middle, where >> s[0..80] would mean 80 characters on the screen and not 80 bytes. > > This is a common fallacy when dealing with Unicode. Please see the linked and > the following points: > > http://utf8everywhere.org/#myth.utf32.o1 True. I didn't think about all the languages out there. Just some common European ones.

On 2012-12-29 23:55, FG wrote: > Probably because I like concise code. I always prefer: > if (A) print(getMessage().getSlice(0..100)); > > to writing something like this: > auto message = getMessage(); > if (A) print(message.length > 100 ? message[0..100] : message); > Actually, when I look at this, it can be a one-liner after all. :) if (A) print(getMessage()[0..($>100?100:$)]); Didn't expect this to work.

On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote: > Forgive me if such a function already exists -- I couldn't find it. std.range have drop and take, which work on code points, not code units. They also handle over-dropping or over-taking gracefully. For example: string s = "okrągły stół"; writeln(s.drop(8).take(3)); // "stó" writeln(s.drop(8).take(100)); // "stół" writeln(s.drop(100).take(100)); // "" http://dpaste.dzfl.pl/2f8ebf49 It doesn't support negative indexing. Generally speaking though, the vast majority of user code should never need to index into a Unicode string.

FG: > to writing something like this: > auto message = getMessage(); > if (A) print(message.length > 100 ? message[0..100] : message); In std.algorithm there is min(), that helps a little: if (A) print(message[0 .. min($, 100)]); Bye, bearophile

Peter Alexander: > Generally speaking though, the vast majority of user code should never need to index into a Unicode string. Right, 90% of the code doesn't need to slice strings (and generally strings are Unicode). But the other 90% of the code needs to slice things... Bye, bearophile

On 2012-12-30 00:03, Peter Alexander wrote: > On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote: >> Forgive me if such a function already exists -- I couldn't find it. > > std.range have drop and take, which work on code points, not code units. They > also handle over-dropping or over-taking gracefully. For example: > > string s = "okrągły stół"; > writeln(s.drop(8).take(3)); // "stó" > writeln(s.drop(8).take(100)); // "stół" > writeln(s.drop(100).take(100)); // "" > Ah, so this is the way of doing it. Thanks. > It doesn't support negative indexing. At least dropping off the back is also possible s[2..$-5]: writeln(s.retro.drop(5).retro.drop(2)); // "rągły" (or with dropBack, without retro, if available) I have no idea how to do s[$-4..$-2] though.

On Sunday, 30 December 2012 at 00:02:17 UTC, FG wrote: > On 2012-12-30 00:03, Peter Alexander wrote: >> On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote: >>> Forgive me if such a function already exists -- I couldn't find it. >> >> std.range have drop and take, which work on code points, not code units. They >> also handle over-dropping or over-taking gracefully. For example: >> >> string s = "okrągły stół"; >> writeln(s.drop(8).take(3)); // "stó" >> writeln(s.drop(8).take(100)); // "stół" >> writeln(s.drop(100).take(100)); // "" >> > > Ah, so this is the way of doing it. Thanks. > > >> It doesn't support negative indexing. > > At least dropping off the back is also possible s[2..$-5]: > > writeln(s.retro.drop(5).retro.drop(2)); // "rągły" > > (or with dropBack, without retro, if available) dropBack is available IFF retro is available. (AFAIK) > I have no idea how to do s[$-4..$-2] though. But as a general rule, making a range out of the first (or last) elements of a non RA range is a limitation of how ranges can "only shrink". strings are a special case of non-RA, non-sliceable range you can index and slice... Anyways, you can always get creative with length: //---- s = "hello world"; s[s.dropBack(4).length .. s.dropBack(2).length]; //---- In this particular example, it is a bit suboptimal, but quite frankly, I'd assume readability trumps performance for this kind of code (and is what I'd use in my end code). One last thing: keep in mind "drop/take" are linear operations. If you are handling unicode, then everything is linear anyways, so I'm not saying these functions are slow or anything, just don't forget they aren't the o(1) functions you'd get with ASCII.

Forums