Thread overview
A better way to write this function? (style question)
Dec 30, 2013
Thomas Gann
Dec 30, 2013
John Colvin
Dec 30, 2013
John Colvin
Dec 30, 2013
Brad Anderson
Dec 30, 2013
John Colvin
Dec 31, 2013
Jacob Carlborg
Dec 30, 2013
Brad Anderson
Dec 30, 2013
bearophile
Dec 31, 2013
Thomas Gann
December 30, 2013
I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question:

string[] split_sentence(string input)
{
    string line;

    foreach(c; input)
    {
        if(c == '\n' || c == '\r')
            line ~= ' ';

        else
            line ~= c.toLower();
    }

    return line.splitter(' ').filter!(a => a.length).array;
}

Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?
December 30, 2013
On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:
> I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question:
>
> string[] split_sentence(string input)
> {
>     string line;
>
>     foreach(c; input)
>     {
>         if(c == '\n' || c == '\r')
>             line ~= ' ';
>
>         else
>             line ~= c.toLower();
>     }
>
>     return line.splitter(' ').filter!(a => a.length).array;
> }
>
> Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?


A few points:

by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning.

using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line'

In my opinion the whole API would be better as range-based:

auto splitSentence(R)(R input)
    if(isInputRange!R)
{
    return input
           .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower)
           .splitter!(' ')
           .filter!(a => !(a.empty));
}
December 30, 2013
On Monday, 30 December 2013 at 22:17:21 UTC, John Colvin wrote:
> On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:
>> I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question:
>>
>> string[] split_sentence(string input)
>> {
>>    string line;
>>
>>    foreach(c; input)
>>    {
>>        if(c == '\n' || c == '\r')
>>            line ~= ' ';
>>
>>        else
>>            line ~= c.toLower();
>>    }
>>
>>    return line.splitter(' ').filter!(a => a.length).array;
>> }
>>
>> Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?
>
>
> A few points:
>
> by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning.
>
> using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line'
>
> In my opinion the whole API would be better as range-based:
>
> auto splitSentence(R)(R input)
>     if(isInputRange!R)
> {
>     return input
>            .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower)
>            .splitter!(' ')
>            .filter!(a => !(a.empty));
> }

sorry, ignore that attempt, it's woefully broken...
December 30, 2013
On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:
> I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question:
>
> string[] split_sentence(string input)
> {
>     string line;
>
>     foreach(c; input)
>     {
>         if(c == '\n' || c == '\r')
>             line ~= ' ';
>
>         else
>             line ~= c.toLower();
>     }
>
>     return line.splitter(' ').filter!(a => a.length).array;
> }
>
> Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?

Not a huge improvement and it could be a lot faster, no doubt:

    string[] split_sentence(string input)
    {
        import std.regex;
        auto split_re = ctRegex!(r"[\n\r ]");
        return input.split(split_re)
                    .map!toLower
                    .filter!(a => !a.empty)
                    .array();
    }

You could return the result of filter instead of an array to make it lazy and avoid an allocation if the caller is able to use it in that form. toLower had some really slow and incorrect behavior but I think that was fixed in 2.064.  It still might be ASCII only though.  I believe std.uni has a toLower that is correct for all of unicode.
December 30, 2013
On Monday, 30 December 2013 at 22:30:02 UTC, John Colvin wrote:
> On Monday, 30 December 2013 at 22:17:21 UTC, John Colvin wrote:
>> On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:
>>> I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question:
>>>
>>> string[] split_sentence(string input)
>>> {
>>>   string line;
>>>
>>>   foreach(c; input)
>>>   {
>>>       if(c == '\n' || c == '\r')
>>>           line ~= ' ';
>>>
>>>       else
>>>           line ~= c.toLower();
>>>   }
>>>
>>>   return line.splitter(' ').filter!(a => a.length).array;
>>> }
>>>
>>> Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?
>>
>>
>> A few points:
>>
>> by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning.
>>
>> using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line'
>>
>> In my opinion the whole API would be better as range-based:
>>
>> auto splitSentence(R)(R input)
>>    if(isInputRange!R)
>> {
>>    return input
>>           .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower)
>>           .splitter!(' ')
>>           .filter!(a => !(a.empty));
>> }
>
> sorry, ignore that attempt, it's woefully broken...

Re: weird literal syntax, you didn't happen to be using dpaste to test and have trouble with character literals, did you?  Because I did and thought I was going insane until realized DPaste was broken.
December 30, 2013
On Monday, 30 December 2013 at 22:38:43 UTC, Brad Anderson wrote:
> On Monday, 30 December 2013 at 22:30:02 UTC, John Colvin wrote:
>> On Monday, 30 December 2013 at 22:17:21 UTC, John Colvin wrote:
>>> On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:
>>>> I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question:
>>>>
>>>> string[] split_sentence(string input)
>>>> {
>>>>  string line;
>>>>
>>>>  foreach(c; input)
>>>>  {
>>>>      if(c == '\n' || c == '\r')
>>>>          line ~= ' ';
>>>>
>>>>      else
>>>>          line ~= c.toLower();
>>>>  }
>>>>
>>>>  return line.splitter(' ').filter!(a => a.length).array;
>>>> }
>>>>
>>>> Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?
>>>
>>>
>>> A few points:
>>>
>>> by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning.
>>>
>>> using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line'
>>>
>>> In my opinion the whole API would be better as range-based:
>>>
>>> auto splitSentence(R)(R input)
>>>   if(isInputRange!R)
>>> {
>>>   return input
>>>          .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower)
>>>          .splitter!(' ')
>>>          .filter!(a => !(a.empty));
>>> }
>>
>> sorry, ignore that attempt, it's woefully broken...
>
> Re: weird literal syntax, you didn't happen to be using dpaste to test and have trouble with character literals, did you?  Because I did and thought I was going insane until realized DPaste was broken.

oohhh so that was what that was. Anyway, here's what I have so far:

import std.range : isInputRange;

auto splitSentence(R)(R input)
    if(isInputRange!R)
{
    import std.algorithm : map, splitter, filter;
    import std.uni : toLower;
    import std.range : ElementType;

    dchar preProc(ElementType!R c)
    {
        return (c == '\n' || c == '\r') ? ' ' : c.toLower;
    }

    return input
        .map!preProc
        .splitter!(c => c == ' ')
        .filter!(a => !(a.empty));
}

I have to have the function instead of a lamda in order to get the return type correct. I guess I could cast or use std.conv.to instead.

Anyway, the big problem I've hit is that AFAICT std.algorithm makes a complete mess of unicode and i can't find a byCodeUnit range anywhere in order to make it correct.
December 30, 2013
Thomas Gann:

> I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace.

Take a look at the ways I have used here:
http://forum.dlang.org/thread/zdhfpftodxnvbpwvklcv@forum.dlang.org

Bye,
bearophile
December 31, 2013
Thanks for all your replies, guys! I have done some further research in the meantime and I have found out that I am, in fact, an idiot. There is actually a standard library function that does exactly what I am trying to do! As it turns out, std.string.split():

1) It automatically discards empty tokens, so there is no need to filter it manually.
2) Its definition of whitespace includes spaces, tabs, and both types of newlines, so there is no need to convert newlines to spaces.

An equivalent, but much less verbose, form of the function becomes the following:

    input.toLower.split;

So I'm noticing a common theme here, though, which is that lots of operations on arrays is bad? Should I be using ranges pretty much everywhere, then? Or just in performance-critical areas of code where I want to avoid a lot of copying? Would that just leave plain vanilla dynamic arrays to the task of (relatively) permanent storage?
December 31, 2013
On 2013-12-30 23:51, John Colvin wrote:

> Anyway, the big problem I've hit is that AFAICT std.algorithm makes a
> complete mess of unicode and i can't find a byCodeUnit range anywhere in
> order to make it correct.

There's a "byGrapheme" and a "byCodePoint" function in std.uni. It was recently added.

-- 
/Jacob Carlborg