April 18, 2015
On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
> One possible solution would be to modify std.uni.graphemeStride to not
> allocate, since it shouldn't need to do so just to compute the length of
> the next grapheme.

That should be done. There should be a fixed maximum codepoint count to graphemeStride.

April 18, 2015
On Sat, Apr 18, 2015 at 10:50:18AM -0700, Walter Bright via Digitalmars-d wrote:
> On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
> >\u0301 is the "combining acute accent" [1].
> >
> >[1] http://www.fileformat.info/info/unicode/char/0301/index.htm
> 
> I won't deny what the spec says, but it doesn't make any sense to have two different representations of eacute, and I don't know why anyone would use the two code point version.

Well, *somebody* has to convert it to the single code point eacute, whether it's the human (if the keyboard has a single key for it), or the code interpreting keystrokes (the user may have typed it as e + combining acute), or the program that generated the combination, or the program that receives the data. When we don't know provenance of incoming data, we have to assume the worst and run normalization to be sure that we got it right.

The two code-point version may also arise from string concatenation, in which case normalization has to be done again (or possibly from the point of concatenation, given the right algorithms).


T

-- 
Mediocrity has been pushed to extremes.
April 18, 2015
On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote:
> On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
> >One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of the next grapheme.
> 
> That should be done. There should be a fixed maximum codepoint count to graphemeStride.

Why? Scanning a string for a grapheme of arbitrary length does not need allocation since you're just reading data. Unless there is some required intermediate representation that I'm not aware of?


T

-- 
"How are you doing?" "Doing what?"
April 18, 2015
On 4/18/2015 11:29 AM, H. S. Teoh via Digitalmars-d wrote:
> On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote:
>> On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
>>> One possible solution would be to modify std.uni.graphemeStride to
>>> not allocate, since it shouldn't need to do so just to compute the
>>> length of the next grapheme.
>>
>> That should be done. There should be a fixed maximum codepoint count
>> to graphemeStride.
>
> Why? Scanning a string for a grapheme of arbitrary length does not need
> allocation since you're just reading data. Unless there is some required
> intermediate representation that I'm not aware of?

If there's no need for allocation at all, why does it allocate? This should be fixed.

April 18, 2015
On 4/18/2015 11:28 AM, H. S. Teoh via Digitalmars-d wrote:
> On Sat, Apr 18, 2015 at 10:50:18AM -0700, Walter Bright via Digitalmars-d wrote:
>> On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
>>> \u0301 is the "combining acute accent" [1].
>>>
>>> [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
>>
>> I won't deny what the spec says, but it doesn't make any sense to have
>> two different representations of eacute, and I don't know why anyone
>> would use the two code point version.
>
> Well, *somebody* has to convert it to the single code point eacute,
> whether it's the human (if the keyboard has a single key for it), or the
> code interpreting keystrokes (the user may have typed it as e +
> combining acute), or the program that generated the combination, or the
> program that receives the data.

Data entry should be handled by the driver program, not a universal interchange format.


> When we don't know provenance of
> incoming data, we have to assume the worst and run normalization to be
> sure that we got it right.

I'm not arguing against the existence of the Unicode standard, I'm saying I can't figure any justification for standardizing different encodings of the same thing.

April 18, 2015
On Sat, Apr 18, 2015 at 11:37:27AM -0700, Walter Bright via Digitalmars-d wrote:
> On 4/18/2015 11:29 AM, H. S. Teoh via Digitalmars-d wrote:
> >On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote:
> >>On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
> >>>One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of the next grapheme.
> >>
> >>That should be done. There should be a fixed maximum codepoint count to graphemeStride.
> >
> >Why? Scanning a string for a grapheme of arbitrary length does not need allocation since you're just reading data. Unless there is some required intermediate representation that I'm not aware of?
> 
> If there's no need for allocation at all, why does it allocate? This should be fixed.

AFAICT, the only reason it allocates is because it shares the same underlying implementation as byGrapheme. There's probably a way to fix this, I just don't have the time right now to figure out the code.


T

-- 
Маленькие детки - маленькие бедки.
April 18, 2015
On Sat, Apr 18, 2015 at 11:40:08AM -0700, Walter Bright via Digitalmars-d wrote:
> On 4/18/2015 11:28 AM, H. S. Teoh via Digitalmars-d wrote:
[...]
> >When we don't know provenance of incoming data, we have to assume the worst and run normalization to be sure that we got it right.
> 
> I'm not arguing against the existence of the Unicode standard, I'm saying I can't figure any justification for standardizing different encodings of the same thing.

Take it up with the Unicode consortium. :-)


T

-- 
Tech-savvy: euphemism for nerdy.
April 18, 2015
On Fri, Apr 17, 2015 at 08:44:51PM +0000, Panke via Digitalmars-d wrote:
> On Friday, 17 April 2015 at 19:44:41 UTC, ketmar wrote:
> >On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:
> >
> >>Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate:
> >
> >there is some... inconsistency: `std.string.wrap` adds final "\n" to string. ;-) but i always hated it for that.
> 
> A range of lines instead of inserted \n would be a good API as well.

Indeed, that would be even more useful, then you could just do .joiner("\n") to get the original functionality.

However, I think Walter's goal here is to match the original wrap()
functionality.

Perhaps the prospective wrapped() function could be implemented in terms
of a byWrappedLines() function which does return a range of wrapped
lines.


T

-- 
The volume of a pizza of thickness a and radius z can be described by the following formula: pi zz a. -- Wouter Verhelst
April 18, 2015
On 4/18/2015 1:22 PM, H. S. Teoh via Digitalmars-d wrote:
> Take it up with the Unicode consortium. :-)

I see nobody knows :-)

April 18, 2015
On 4/18/2015 1:32 PM, H. S. Teoh via Digitalmars-d wrote:
> However, I think Walter's goal here is to match the original wrap()
> functionality.

Yes, although the overarching goal is:

    Minimize Need For Using GC In Phobos

and the method here is to use ranges rather than having to allocate string temporaries.