Jump to page: 1 27  
Page
Thread overview
Today's programming challenge - How's your Range-Fu ?
Apr 17, 2015
Walter Bright
Apr 17, 2015
H. S. Teoh
Apr 17, 2015
Walter Bright
Apr 18, 2015
John Colvin
Apr 18, 2015
Jacob Carlborg
Apr 18, 2015
Walter Bright
Apr 18, 2015
Panke
Apr 18, 2015
Walter Bright
Apr 18, 2015
Panke
Apr 18, 2015
Jacob Carlborg
Apr 18, 2015
Chris
Apr 18, 2015
Gary Willoughby
Apr 18, 2015
Jacob Carlborg
Apr 18, 2015
Jakob Ovrum
Apr 18, 2015
H. S. Teoh
Apr 18, 2015
Tobias Pankrath
Apr 18, 2015
Chris
Apr 18, 2015
Walter Bright
Apr 18, 2015
H. S. Teoh
Apr 18, 2015
Walter Bright
Apr 18, 2015
H. S. Teoh
Apr 18, 2015
Tobias Pankrath
Apr 20, 2015
Chris
Apr 20, 2015
Panke
Apr 20, 2015
Chris
Apr 20, 2015
Panke
Apr 20, 2015
John Colvin
Apr 20, 2015
H. S. Teoh
Apr 20, 2015
Panke
Apr 20, 2015
rumbu
Apr 21, 2015
JohnnyK
Apr 21, 2015
John Colvin
Apr 20, 2015
Chris
Apr 19, 2015
John Colvin
Apr 18, 2015
Walter Bright
Apr 18, 2015
H. S. Teoh
Apr 18, 2015
Walter Bright
Apr 18, 2015
H. S. Teoh
Apr 18, 2015
Walter Bright
Apr 19, 2015
Shachar Shemesh
Apr 19, 2015
Abdulhaq
Apr 19, 2015
Shachar Shemesh
Apr 19, 2015
John Colvin
Apr 19, 2015
ketmar
Apr 19, 2015
weaselcat
Apr 20, 2015
Nick B
Apr 20, 2015
ketmar
Apr 20, 2015
Nick B
Apr 20, 2015
Jacob Carlborg
Apr 20, 2015
Shachar Shemesh
Apr 18, 2015
Paulo Pinto
Apr 18, 2015
Tobias Pankrath
Apr 18, 2015
Shachar Shemesh
Apr 17, 2015
H. S. Teoh
Apr 17, 2015
H. S. Teoh
Apr 17, 2015
Walter Bright
Apr 17, 2015
H. S. Teoh
Apr 17, 2015
Walter Bright
Apr 17, 2015
ketmar
Apr 17, 2015
Panke
Apr 18, 2015
H. S. Teoh
Apr 18, 2015
Walter Bright
April 17, 2015
Challenge level - Moderately easy

Consider the function std.string.wrap:

  http://dlang.org/phobos/std_string.html#.wrap

It takes a string as input, and returns a GC allocated string that is word-wrapped. It needs to be enhanced to:

1. Accept a ForwardRange as input.
2. Return a lazy ForwardRange that delivers the characters of the wrapped result one by one on demand.
3. Not allocate any memory.
4. The element encoding type of the returned range must be the same as the element encoding type of the input.

April 17, 2015
On Fri, Apr 17, 2015 at 02:09:07AM -0700, Walter Bright via Digitalmars-d wrote:
> Challenge level - Moderately easy
>
> Consider the function std.string.wrap:
> 
>   http://dlang.org/phobos/std_string.html#.wrap
> 
> It takes a string as input, and returns a GC allocated string that is word-wrapped. It needs to be enhanced to:
> 
> 1. Accept a ForwardRange as input.
> 2. Return a lazy ForwardRange that delivers the characters of the
> wrapped result one by one on demand.
> 3. Not allocate any memory.
> 4. The element encoding type of the returned range must be the same as
> the element encoding type of the input.

This is harder than it looks at first sight, actually. Mostly thanks to the complexity of Unicode... you need to identify zero-width, normal-width, and double-width characters, combining diacritics, various kinds of spaces (e.g. cannot break on non-breaking space) and treat them accordingly.  Which requires decoding.  (Well, in theory std.uni could be enhanced to work directly with encoded data, but right now it doesn't. In any case this is outside the scope of this challenge, I think.)

Unfortunately, the only reliable way I know of currently that can deal with the spacing of Unicode characters correctly is to segment the input with byGrapheme, which currently is GC-dependent. So this fails (3).

There's also the question of what to do with bidi markings: how do you handle counting the columns in that case?

Of course, if you forego Unicode correctness, then you *could* just word-wrap on a per-character basis (i.e., every character counts as 1 column), but this also makes the resulting code useless as far as dealing with general Unicode data is concerned -- it'd only work for ASCII, and various character ranges inherited from the old 8-bit European encodings. Not to mention, line-breaking in Chinese encodings cannot work as prescribed anyway, because the rules are different (you can break anywhere at a character boundary except punctuation -- there is no such concept as a space character in Chinese writing). Same applies for Korean/Japanese.

So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's).


T

-- 
All problems are easy in retrospect.
April 17, 2015
On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote: [...]
> -- 
> All problems are easy in retrospect.

Argh, my Perl script doth mock me!


T

-- 
Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder
April 17, 2015
On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote: [...]
> So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's).
[...]

Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate:

	import std.range.primitives;

	/**
	 * Range version of $(D std.string.wrap).
	 *
	 * Bugs:
	 * This function does not conform to the Unicode line-breaking algorithm. It
	 * does not take into account zero-width characters, combining diacritics,
	 * double-width characters, non-breaking spaces, and bidi markings.  Strings
	 * containing these characters therefore may not be wrapped correctly.
	 */
	auto wrapped(R)(R range, in size_t columns = 80, R firstindent = null,
	                R indent = null, in size_t tabsize = 8)
	    if (isForwardRange!R && is(ElementType!R : dchar))
	{
	    import std.algorithm.iteration : map, joiner;
	    import std.range : chain;
	    import std.uni;

	    alias CharType = ElementType!R;

	    // Returns: Wrapped lines.
	    struct Result
	    {
	        private R range, indent;
	        private size_t maxCols, tabSize;

	        private size_t currentCol = 0;
	        private R curIndent;
	        bool empty = true;
	        bool atBreak = false;

	        this(R _range, R _firstindent, R _indent, size_t columns, size_t tabsize)
	        {
	            this.range = _range;
	            this.curIndent = _firstindent.save;
	            this.indent = _indent;
	            this.maxCols = columns;
	            this.tabSize = tabsize;

	            empty = _range.empty;

	        }

	        @property CharType front()
	        {
	            if (atBreak)
	                return '\n';    // should implicit convert to wider characters
	            else if (!curIndent.empty)
	                return curIndent.front;
	            else
	                return range.front;
	        }

	        void popFront()
	        {
	            if (atBreak)
	            {
	                // We're at a linebreak.
	                atBreak = false;
	                currentCol = 0;

	                // Start new line with indent
	                curIndent = indent.save;
	                return;
	            }
	            else if (!curIndent.empty)
	            {
	                // We're iterating over an initial indent.
	                curIndent.popFront();
	                currentCol++;
	                return;
	            }

	            // We're iterating over the main range.
	            range.popFront();
	            if (range.empty)
	            {
	                empty = true;
	                return;
	            }

	            if (range.front == '\t')
	                currentCol += tabSize;
	            else if (isWhite(range.front))
	            {
	                // Scan for next word boundary to decide whether or not to
	                // break here.
	                R tmp = range.save;
	                assert(!tmp.empty);

	                size_t col = currentCol;

	                // Find start of next word
	                while (!tmp.empty && isWhite(tmp.front))
	                {
	                    col++;
	                    tmp.popFront();
	                }

	                // Remember start of next word so that if we need to break, we
	                // won't introduce extraneous spaces to the start of the new
	                // line.
	                R nextWord = tmp.save;

	                while (!tmp.empty && !isWhite(tmp.front))
	                {
	                    col++;
	                    tmp.popFront();
	                }
	                assert(tmp.empty || isWhite(tmp.front));

	                if (col > maxCols)
	                {
	                    // Word wrap needed. Move current range position to
	                    // start of next word.
	                    atBreak = true;
	                    range = nextWord;
	                    return;
	                }
	            }
	            currentCol++;
	        }

	        @property Result save()
	        {
	            Result copy = this;
	            copy.range = this.range.save;
	            //copy.indent = this.indent.save; // probably not needed?
	            copy.curIndent = this.curIndent.save;
	            return copy;
	        }
	    }
	    static assert(isForwardRange!Result);

	    return Result(range, firstindent, indent, columns, tabsize);
	}

	unittest
	{
	    import std.algorithm.comparison : equal;

	    auto s = ("This is a very long, artificially long, and gratuitously long "~
	              "single-line sentence to serve as a test case for byParagraph.")
	             .wrapped(30, ">>>>", ">>");
	    assert(s.equal(
	        ">>>>This is a very long,\n"~
	        ">>artificially long, and\n"~
	        ">>gratuitously long single-line\n"~
	        ">>sentence to serve as a test\n"~
	        ">>case for byParagraph."
	    ));
	}


I didn't bother with avoiding autodecoding -- that should be relatively easy to add, but I think it's stupid that we have to continually write workarounds in our code to get around auto-decoding. If it's so important that we don't autodecode, can we pretty please make the stupid decision already and kill it off for good?!


T

-- 
To err is human; to forgive is not our policy. -- Samuel Adler
April 17, 2015
On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
> So either you have to throw out all pretenses of Unicode-correctness and
> just stick with ASCII-style per-character line-wrapping, or you have to
> live with byGrapheme with all the complexity that it entails. The former
> is quite easy to write -- I could throw it together in a couple o' hours
> max, but the latter is a pretty big project (cf. Unicode line-breaking
> algorithm, which is one of the TR's).

It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.

April 17, 2015
On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:
> Well, talk is cheap, so here's a working implementation of the
> non-Unicode-correct line wrapper that uses ranges and does not allocate:

awesome! Please make a pull request for this so you get proper credit!

April 17, 2015
On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d wrote:
> On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:
> >Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate:
> 
> awesome! Please make a pull request for this so you get proper credit!

Doesn't that mean I have to add the autodecoding workarounds first?


T

-- 
Life is too short to run proprietary software. -- Bdale Garbee
April 17, 2015
On 4/17/2015 11:46 AM, H. S. Teoh via Digitalmars-d wrote:
> On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d wrote:
>> On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:
>>> Well, talk is cheap, so here's a working implementation of the
>>> non-Unicode-correct line wrapper that uses ranges and does not
>>> allocate:
>>
>> awesome! Please make a pull request for this so you get proper credit!
>
> Doesn't that mean I have to add the autodecoding workarounds first?

Before it gets pulled, yes, meaning that the element type of front() should match the element encoding type of Range.

There's also an issue with firstindent and indent being the same range type as 'range', which is not practical as Range is likely a voldemort type. I suggest making them simply of type 'string'. I don't see any point to make them ranges.

A unit test with an input range is needed, and one with some multibyte unicode encodings.

April 17, 2015
On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:

> Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate:

there is some... inconsistency: `std.string.wrap` adds final "\n" to string. ;-) but i always hated it for that.

April 17, 2015
On Friday, 17 April 2015 at 19:44:41 UTC, ketmar wrote:
> On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:
>
>> Well, talk is cheap, so here's a working implementation of the
>> non-Unicode-correct line wrapper that uses ranges and does not allocate:
>
> there is some... inconsistency: `std.string.wrap` adds final "\n" to
> string. ;-) but i always hated it for that.

A range of lines instead of inserted \n would be a good API as well.
« First   ‹ Prev
1 2 3 4 5 6 7