A better way to write this function? (style question) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » A better way to write this function? (style question)

Thread overview

A better way to write this function? (style question)
Dec 30, 2013 Thomas Gann
Dec 30, 2013 John Colvin
Dec 30, 2013 John Colvin
Dec 30, 2013 Brad Anderson
Dec 30, 2013 John Colvin
Dec 31, 2013 Jacob Carlborg
Dec 30, 2013 Brad Anderson
Dec 30, 2013 bearophile
Dec 31, 2013 Thomas Gann

December 30, 2013

A better way to write this function? (style question)

Posted by Thomas Gann

Thomas Gann

I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question:

string[] split_sentence(string input)
{
    string line;

    foreach(c; input)
    {
        if(c == '\n' || c == '\r')
            line ~= ' ';

        else
            line ~= c.toLower();
    }

    return line.splitter(' ').filter!(a => a.length).array;
}

Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?

December 30, 2013

Re: A better way to write this function? (style question)

Posted by John Colvin
in reply to Thomas Gann

John Colvin

Posted in reply to Thomas Gann

On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:
> I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question:
>
> string[] split_sentence(string input)
> {
>     string line;
>
>     foreach(c; input)
>     {
>         if(c == '\n' || c == '\r')
>             line ~= ' ';
>
>         else
>             line ~= c.toLower();
>     }
>
>     return line.splitter(' ').filter!(a => a.length).array;
> }
>
> Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?


A few points:

by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning.

using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line'

In my opinion the whole API would be better as range-based:

auto splitSentence(R)(R input)
    if(isInputRange!R)
{
    return input
           .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower)
           .splitter!(' ')
           .filter!(a => !(a.empty));
}

December 30, 2013

Re: A better way to write this function? (style question)

Posted by John Colvin
in reply to John Colvin

John Colvin

Posted in reply to John Colvin

On Monday, 30 December 2013 at 22:17:21 UTC, John Colvin wrote:
> On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:
>> I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question:
>>
>> string[] split_sentence(string input)
>> {
>>    string line;
>>
>>    foreach(c; input)
>>    {
>>        if(c == '\n' || c == '\r')
>>            line ~= ' ';
>>
>>        else
>>            line ~= c.toLower();
>>    }
>>
>>    return line.splitter(' ').filter!(a => a.length).array;
>> }
>>
>> Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?
>
>
> A few points:
>
> by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning.
>
> using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line'
>
> In my opinion the whole API would be better as range-based:
>
> auto splitSentence(R)(R input)
>     if(isInputRange!R)
> {
>     return input
>            .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower)
>            .splitter!(' ')
>            .filter!(a => !(a.empty));
> }

sorry, ignore that attempt, it's woefully broken...

December 30, 2013

Re: A better way to write this function? (style question)

Posted by Brad Anderson
in reply to Thomas Gann

Brad Anderson

Posted in reply to Thomas Gann

On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:
> I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question:
>
> string[] split_sentence(string input)
> {
>     string line;
>
>     foreach(c; input)
>     {
>         if(c == '\n' || c == '\r')
>             line ~= ' ';
>
>         else
>             line ~= c.toLower();
>     }
>
>     return line.splitter(' ').filter!(a => a.length).array;
> }
>
> Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?

Not a huge improvement and it could be a lot faster, no doubt:

    string[] split_sentence(string input)
    {
        import std.regex;
        auto split_re = ctRegex!(r"[\n\r ]");
        return input.split(split_re)
                    .map!toLower
                    .filter!(a => !a.empty)
                    .array();
    }

You could return the result of filter instead of an array to make it lazy and avoid an allocation if the caller is able to use it in that form. toLower had some really slow and incorrect behavior but I think that was fixed in 2.064.  It still might be ASCII only though.  I believe std.uni has a toLower that is correct for all of unicode.

December 30, 2013

Re: A better way to write this function? (style question)

Posted by Brad Anderson
in reply to John Colvin

Brad Anderson

Posted in reply to John Colvin

On Monday, 30 December 2013 at 22:30:02 UTC, John Colvin wrote:
> On Monday, 30 December 2013 at 22:17:21 UTC, John Colvin wrote:
>> On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:
>>> I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question:
>>>
>>> string[] split_sentence(string input)
>>> {
>>>   string line;
>>>
>>>   foreach(c; input)
>>>   {
>>>       if(c == '\n' || c == '\r')
>>>           line ~= ' ';
>>>
>>>       else
>>>           line ~= c.toLower();
>>>   }
>>>
>>>   return line.splitter(' ').filter!(a => a.length).array;
>>> }
>>>
>>> Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?
>>
>>
>> A few points:
>>
>> by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning.
>>
>> using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line'
>>
>> In my opinion the whole API would be better as range-based:
>>
>> auto splitSentence(R)(R input)
>>    if(isInputRange!R)
>> {
>>    return input
>>           .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower)
>>           .splitter!(' ')
>>           .filter!(a => !(a.empty));
>> }
>
> sorry, ignore that attempt, it's woefully broken...

Re: weird literal syntax, you didn't happen to be using dpaste to test and have trouble with character literals, did you?  Because I did and thought I was going insane until realized DPaste was broken.

December 30, 2013

Re: A better way to write this function? (style question)

Posted by John Colvin
in reply to Brad Anderson

John Colvin

Posted in reply to Brad Anderson

On Monday, 30 December 2013 at 22:38:43 UTC, Brad Anderson wrote:
> On Monday, 30 December 2013 at 22:30:02 UTC, John Colvin wrote:
>> On Monday, 30 December 2013 at 22:17:21 UTC, John Colvin wrote:
>>> On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:
>>>> I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question:
>>>>
>>>> string[] split_sentence(string input)
>>>> {
>>>>  string line;
>>>>
>>>>  foreach(c; input)
>>>>  {
>>>>      if(c == '\n' || c == '\r')
>>>>          line ~= ' ';
>>>>
>>>>      else
>>>>          line ~= c.toLower();
>>>>  }
>>>>
>>>>  return line.splitter(' ').filter!(a => a.length).array;
>>>> }
>>>>
>>>> Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?
>>>
>>>
>>> A few points:
>>>
>>> by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning.
>>>
>>> using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line'
>>>
>>> In my opinion the whole API would be better as range-based:
>>>
>>> auto splitSentence(R)(R input)
>>>   if(isInputRange!R)
>>> {
>>>   return input
>>>          .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower)
>>>          .splitter!(' ')
>>>          .filter!(a => !(a.empty));
>>> }
>>
>> sorry, ignore that attempt, it's woefully broken...
>
> Re: weird literal syntax, you didn't happen to be using dpaste to test and have trouble with character literals, did you?  Because I did and thought I was going insane until realized DPaste was broken.

oohhh so that was what that was. Anyway, here's what I have so far:

import std.range : isInputRange;

auto splitSentence(R)(R input)
    if(isInputRange!R)
{
    import std.algorithm : map, splitter, filter;
    import std.uni : toLower;
    import std.range : ElementType;

    dchar preProc(ElementType!R c)
    {
        return (c == '\n' || c == '\r') ? ' ' : c.toLower;
    }

    return input
        .map!preProc
        .splitter!(c => c == ' ')
        .filter!(a => !(a.empty));
}

I have to have the function instead of a lamda in order to get the return type correct. I guess I could cast or use std.conv.to instead.

Anyway, the big problem I've hit is that AFAICT std.algorithm makes a complete mess of unicode and i can't find a byCodeUnit range anywhere in order to make it correct.

December 30, 2013

Re: A better way to write this function? (style question)

Posted by bearophile
in reply to Thomas Gann

bearophile

Posted in reply to Thomas Gann

Thomas Gann:

> I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace.

Take a look at the ways I have used here:
http://forum.dlang.org/thread/zdhfpftodxnvbpwvklcv@forum.dlang.org

Bye,
bearophile

December 31, 2013

Re: A better way to write this function? (style question)

Posted by Thomas Gann
in reply to Thomas Gann

Thomas Gann

Posted in reply to Thomas Gann

Thanks for all your replies, guys! I have done some further research in the meantime and I have found out that I am, in fact, an idiot. There is actually a standard library function that does exactly what I am trying to do! As it turns out, std.string.split():

1) It automatically discards empty tokens, so there is no need to filter it manually.
2) Its definition of whitespace includes spaces, tabs, and both types of newlines, so there is no need to convert newlines to spaces.

An equivalent, but much less verbose, form of the function becomes the following:

    input.toLower.split;

So I'm noticing a common theme here, though, which is that lots of operations on arrays is bad? Should I be using ranges pretty much everywhere, then? Or just in performance-critical areas of code where I want to avoid a lot of copying? Would that just leave plain vanilla dynamic arrays to the task of (relatively) permanent storage?

December 31, 2013

Re: A better way to write this function? (style question)

Posted by Jacob Carlborg
in reply to John Colvin

Jacob Carlborg

Posted in reply to John Colvin

On 2013-12-30 23:51, John Colvin wrote:

> Anyway, the big problem I've hit is that AFAICT std.algorithm makes a
> complete mess of unicode and i can't find a byCodeUnit range anywhere in
> order to make it correct.

There's a "byGrapheme" and a "byCodePoint" function in std.uni. It was recently added.

-- 
/Jacob Carlborg

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation