Thread overview | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
April 17, 2015 Today's programming challenge - How's your Range-Fu ? | ||||
---|---|---|---|---|
| ||||
Challenge level - Moderately easy Consider the function std.string.wrap: http://dlang.org/phobos/std_string.html#.wrap It takes a string as input, and returns a GC allocated string that is word-wrapped. It needs to be enhanced to: 1. Accept a ForwardRange as input. 2. Return a lazy ForwardRange that delivers the characters of the wrapped result one by one on demand. 3. Not allocate any memory. 4. The element encoding type of the returned range must be the same as the element encoding type of the input. |
April 17, 2015 Re: Today's programming challenge - How's your Range-Fu ? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Fri, Apr 17, 2015 at 02:09:07AM -0700, Walter Bright via Digitalmars-d wrote: > Challenge level - Moderately easy > > Consider the function std.string.wrap: > > http://dlang.org/phobos/std_string.html#.wrap > > It takes a string as input, and returns a GC allocated string that is word-wrapped. It needs to be enhanced to: > > 1. Accept a ForwardRange as input. > 2. Return a lazy ForwardRange that delivers the characters of the > wrapped result one by one on demand. > 3. Not allocate any memory. > 4. The element encoding type of the returned range must be the same as > the element encoding type of the input. This is harder than it looks at first sight, actually. Mostly thanks to the complexity of Unicode... you need to identify zero-width, normal-width, and double-width characters, combining diacritics, various kinds of spaces (e.g. cannot break on non-breaking space) and treat them accordingly. Which requires decoding. (Well, in theory std.uni could be enhanced to work directly with encoded data, but right now it doesn't. In any case this is outside the scope of this challenge, I think.) Unfortunately, the only reliable way I know of currently that can deal with the spacing of Unicode characters correctly is to segment the input with byGrapheme, which currently is GC-dependent. So this fails (3). There's also the question of what to do with bidi markings: how do you handle counting the columns in that case? Of course, if you forego Unicode correctness, then you *could* just word-wrap on a per-character basis (i.e., every character counts as 1 column), but this also makes the resulting code useless as far as dealing with general Unicode data is concerned -- it'd only work for ASCII, and various character ranges inherited from the old 8-bit European encodings. Not to mention, line-breaking in Chinese encodings cannot work as prescribed anyway, because the rules are different (you can break anywhere at a character boundary except punctuation -- there is no such concept as a space character in Chinese writing). Same applies for Korean/Japanese. So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's). T -- All problems are easy in retrospect. |
April 17, 2015 Re: Today's programming challenge - How's your Range-Fu ? | ||||
---|---|---|---|---|
| ||||
On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote: [...] > -- > All problems are easy in retrospect. Argh, my Perl script doth mock me! T -- Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder |
April 17, 2015 Re: Today's programming challenge - How's your Range-Fu ? | ||||
---|---|---|---|---|
| ||||
On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote: [...] > So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's). [...] Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate: import std.range.primitives; /** * Range version of $(D std.string.wrap). * * Bugs: * This function does not conform to the Unicode line-breaking algorithm. It * does not take into account zero-width characters, combining diacritics, * double-width characters, non-breaking spaces, and bidi markings. Strings * containing these characters therefore may not be wrapped correctly. */ auto wrapped(R)(R range, in size_t columns = 80, R firstindent = null, R indent = null, in size_t tabsize = 8) if (isForwardRange!R && is(ElementType!R : dchar)) { import std.algorithm.iteration : map, joiner; import std.range : chain; import std.uni; alias CharType = ElementType!R; // Returns: Wrapped lines. struct Result { private R range, indent; private size_t maxCols, tabSize; private size_t currentCol = 0; private R curIndent; bool empty = true; bool atBreak = false; this(R _range, R _firstindent, R _indent, size_t columns, size_t tabsize) { this.range = _range; this.curIndent = _firstindent.save; this.indent = _indent; this.maxCols = columns; this.tabSize = tabsize; empty = _range.empty; } @property CharType front() { if (atBreak) return '\n'; // should implicit convert to wider characters else if (!curIndent.empty) return curIndent.front; else return range.front; } void popFront() { if (atBreak) { // We're at a linebreak. atBreak = false; currentCol = 0; // Start new line with indent curIndent = indent.save; return; } else if (!curIndent.empty) { // We're iterating over an initial indent. curIndent.popFront(); currentCol++; return; } // We're iterating over the main range. range.popFront(); if (range.empty) { empty = true; return; } if (range.front == '\t') currentCol += tabSize; else if (isWhite(range.front)) { // Scan for next word boundary to decide whether or not to // break here. R tmp = range.save; assert(!tmp.empty); size_t col = currentCol; // Find start of next word while (!tmp.empty && isWhite(tmp.front)) { col++; tmp.popFront(); } // Remember start of next word so that if we need to break, we // won't introduce extraneous spaces to the start of the new // line. R nextWord = tmp.save; while (!tmp.empty && !isWhite(tmp.front)) { col++; tmp.popFront(); } assert(tmp.empty || isWhite(tmp.front)); if (col > maxCols) { // Word wrap needed. Move current range position to // start of next word. atBreak = true; range = nextWord; return; } } currentCol++; } @property Result save() { Result copy = this; copy.range = this.range.save; //copy.indent = this.indent.save; // probably not needed? copy.curIndent = this.curIndent.save; return copy; } } static assert(isForwardRange!Result); return Result(range, firstindent, indent, columns, tabsize); } unittest { import std.algorithm.comparison : equal; auto s = ("This is a very long, artificially long, and gratuitously long "~ "single-line sentence to serve as a test case for byParagraph.") .wrapped(30, ">>>>", ">>"); assert(s.equal( ">>>>This is a very long,\n"~ ">>artificially long, and\n"~ ">>gratuitously long single-line\n"~ ">>sentence to serve as a test\n"~ ">>case for byParagraph." )); } I didn't bother with avoiding autodecoding -- that should be relatively easy to add, but I think it's stupid that we have to continually write workarounds in our code to get around auto-decoding. If it's so important that we don't autodecode, can we pretty please make the stupid decision already and kill it off for good?! T -- To err is human; to forgive is not our policy. -- Samuel Adler |
April 17, 2015 Re: Today's programming challenge - How's your Range-Fu ? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
> So either you have to throw out all pretenses of Unicode-correctness and
> just stick with ASCII-style per-character line-wrapping, or you have to
> live with byGrapheme with all the complexity that it entails. The former
> is quite easy to write -- I could throw it together in a couple o' hours
> max, but the latter is a pretty big project (cf. Unicode line-breaking
> algorithm, which is one of the TR's).
It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.
|
April 17, 2015 Re: Today's programming challenge - How's your Range-Fu ? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:
> Well, talk is cheap, so here's a working implementation of the
> non-Unicode-correct line wrapper that uses ranges and does not allocate:
awesome! Please make a pull request for this so you get proper credit!
|
April 17, 2015 Re: Today's programming challenge - How's your Range-Fu ? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d wrote: > On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote: > >Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate: > > awesome! Please make a pull request for this so you get proper credit! Doesn't that mean I have to add the autodecoding workarounds first? T -- Life is too short to run proprietary software. -- Bdale Garbee |
April 17, 2015 Re: Today's programming challenge - How's your Range-Fu ? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On 4/17/2015 11:46 AM, H. S. Teoh via Digitalmars-d wrote:
> On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d wrote:
>> On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:
>>> Well, talk is cheap, so here's a working implementation of the
>>> non-Unicode-correct line wrapper that uses ranges and does not
>>> allocate:
>>
>> awesome! Please make a pull request for this so you get proper credit!
>
> Doesn't that mean I have to add the autodecoding workarounds first?
Before it gets pulled, yes, meaning that the element type of front() should match the element encoding type of Range.
There's also an issue with firstindent and indent being the same range type as 'range', which is not practical as Range is likely a voldemort type. I suggest making them simply of type 'string'. I don't see any point to make them ranges.
A unit test with an input range is needed, and one with some multibyte unicode encodings.
|
April 17, 2015 Re: Today's programming challenge - How's your Range-Fu ? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh Attachments: | On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:
> Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate:
there is some... inconsistency: `std.string.wrap` adds final "\n" to string. ;-) but i always hated it for that.
|
April 17, 2015 Re: Today's programming challenge - How's your Range-Fu ? | ||||
---|---|---|---|---|
| ||||
Posted in reply to ketmar | On Friday, 17 April 2015 at 19:44:41 UTC, ketmar wrote:
> On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:
>
>> Well, talk is cheap, so here's a working implementation of the
>> non-Unicode-correct line wrapper that uses ranges and does not allocate:
>
> there is some... inconsistency: `std.string.wrap` adds final "\n" to
> string. ;-) but i always hated it for that.
A range of lines instead of inserted \n would be a good API as well.
|
Copyright © 1999-2021 by the D Language Foundation