April 18, 2015
On 17/04/15 19:59, H. S. Teoh via Digitalmars-d wrote:
> There's also the question of what to do with bidi markings: how do you
> handle counting the columns in that case?
>

Which BiDi marking are you referring to? LRM/RLM and friends? If so, don't worry: the interface, as described, is incapable of properly handling BiDi anyways.

The proper way to handle BiDi line wrapping is this. First you assign a BiDi level to each character (at which point the markings are, effectively, removed from the input, so there goes your problem). Then you calculate the glyph's width until the line limit is reached, and then you reorder each line according to the BiDi levels you calculated earlier.

As can be easily seen, this requires transitioning BiDi information that is per-paragraph across the line break logic, pretty much mandating multiple passes on the input. Since the requested interface does not allow that, proper BiDi line breaking is impossible with that interface.

I'll mention that not everyone take that as a serious problem. Window's text control, for example, calculates line breaks on the text, and then runs the BiDi algorithm on each line individually. Few people notice this. Then again, people have already grown used to BiDi text being scrambled.

Shachar
April 18, 2015
On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
> On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
>> So either you have to throw out all pretenses of Unicode-correctness and
>> just stick with ASCII-style per-character line-wrapping, or you have to
>> live with byGrapheme with all the complexity that it entails. The former
>> is quite easy to write -- I could throw it together in a couple o' hours
>> max, but the latter is a pretty big project (cf. Unicode line-breaking
>> algorithm, which is one of the TR's).
>
> It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.

Code points aren't equivalent to characters. They're not the same thing in most European languages, never mind the rest of the world. If we have a line-wrapping algorithm in phobos that works by code points, it needs a large "THIS IS ONLY FOR SIMPLE ENGLISH TEXT" warning.

Code points are a useful chunk size for some tasjs and completely insufficient for others.
April 18, 2015
On 2015-04-18 09:58, John Colvin wrote:

> Code points aren't equivalent to characters. They're not the same thing
> in most European languages, never mind the rest of the world. If we have
> a line-wrapping algorithm in phobos that works by code points, it needs
> a large "THIS IS ONLY FOR SIMPLE ENGLISH TEXT" warning.

For that we have std.ascii.

-- 
/Jacob Carlborg
April 18, 2015
On 4/18/2015 12:58 AM, John Colvin wrote:
> On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
>> On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
>>> So either you have to throw out all pretenses of Unicode-correctness and
>>> just stick with ASCII-style per-character line-wrapping, or you have to
>>> live with byGrapheme with all the complexity that it entails. The former
>>> is quite easy to write -- I could throw it together in a couple o' hours
>>> max, but the latter is a pretty big project (cf. Unicode line-breaking
>>> algorithm, which is one of the TR's).
>>
>> It'd be good enough to duplicate the existing behavior, which is to treat
>> decoded unicode characters as one column.
>
> Code points aren't equivalent to characters. They're not the same thing in most
> European languages,

I know a bit of German, for what characters is that not true?

> never mind the rest of the world. If we have a line-wrapping
> algorithm in phobos that works by code points, it needs a large "THIS IS ONLY
> FOR SIMPLE ENGLISH TEXT" warning.
>
> Code points are a useful chunk size for some tasjs and completely insufficient
> for others.

The first order of business is making wrap() work with ranges, and otherwise work the same as it always has (it's one of the oldest Phobos functions).

There are different standard levels of Unicode support. The lowest level is working correctly with code points, which is what wrap() does. Going to a higher level of support comes after range support.

I know little about combining characters. You obviously know much more, do you want to take charge of this function?
April 18, 2015
On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:
> On 4/18/2015 12:58 AM, John Colvin wrote:
>> On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
>>> On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
>>>> So either you have to throw out all pretenses of Unicode-correctness and
>>>> just stick with ASCII-style per-character line-wrapping, or you have to
>>>> live with byGrapheme with all the complexity that it entails. The former
>>>> is quite easy to write -- I could throw it together in a couple o' hours
>>>> max, but the latter is a pretty big project (cf. Unicode line-breaking
>>>> algorithm, which is one of the TR's).
>>>
>>> It'd be good enough to duplicate the existing behavior, which is to treat
>>> decoded unicode characters as one column.
>>
>> Code points aren't equivalent to characters. They're not the same thing in most
>> European languages,
>
> I know a bit of German, for what characters is that not true?

Umlauts, if combined characters are used. Also words that still have their accents left after import from foreign languages. E.g. Café

Getting all unicode correct seems a daunting task with a severe performance impact, esp. if we need to assume that a string might have any normalization form or none at all.

See also: http://unicode.org/reports/tr15/#Norm_Forms
April 18, 2015
On 4/18/2015 1:26 AM, Panke wrote:
> On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:
>> On 4/18/2015 12:58 AM, John Colvin wrote:
>>> On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
>>>> On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
>>>>> So either you have to throw out all pretenses of Unicode-correctness and
>>>>> just stick with ASCII-style per-character line-wrapping, or you have to
>>>>> live with byGrapheme with all the complexity that it entails. The former
>>>>> is quite easy to write -- I could throw it together in a couple o' hours
>>>>> max, but the latter is a pretty big project (cf. Unicode line-breaking
>>>>> algorithm, which is one of the TR's).
>>>>
>>>> It'd be good enough to duplicate the existing behavior, which is to treat
>>>> decoded unicode characters as one column.
>>>
>>> Code points aren't equivalent to characters. They're not the same thing in most
>>> European languages,
>>
>> I know a bit of German, for what characters is that not true?
>
> Umlauts, if combined characters are used. Also words that still have their
> accents left after import from foreign languages. E.g. Café

That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments.

April 18, 2015
> That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments.

Yes, but you may have perfectly fine unicode text where the combined form is used. Actually there is a normalization form for unicode that requires the combined form. To be fully correct phobos needs to handle that as well.

April 18, 2015
On 2015-04-18 12:27, Walter Bright wrote:

> That doesn't make sense to me, because the umlauts and the accented e
> all have Unicode code point assignments.

This code snippet demonstrates the problem:

import std.stdio;

void main ()
{
    dstring a = "e\u0301";
    dstring b = "é";
    assert(a != b);
    assert(a.length == 2);
    assert(b.length == 1);
    writefln(a, " ", b);
}

If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same.

\u0301 is the "combining acute accent" [1].

[1] http://www.fileformat.info/info/unicode/char/0301/index.htm

-- 
/Jacob Carlborg
April 18, 2015
On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:
> On 2015-04-18 12:27, Walter Bright wrote:
>
>> That doesn't make sense to me, because the umlauts and the accented e
>> all have Unicode code point assignments.
>
> This code snippet demonstrates the problem:
>
> import std.stdio;
>
> void main ()
> {
>     dstring a = "e\u0301";
>     dstring b = "é";
>     assert(a != b);
>     assert(a.length == 2);
>     assert(b.length == 1);
>     writefln(a, " ", b);
> }
>
> If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same.
>
> \u0301 is the "combining acute accent" [1].
>
> [1] http://www.fileformat.info/info/unicode/char/0301/index.htm

Yep, this was the cause of some bugs I had in my program. The thing is you never know, if a text is composed or decomposed, so you have to be prepared that "é" has length 2 or 1. On OS X these characters are automatically decomposed by default. So if you pipe it through the system an "é" (length=1) automatically becomes "e\u0301" (length=2). Same goes for file names on OS X. I've had to find a workaround for this more than once.
April 18, 2015
On Saturday, 18 April 2015 at 08:26:12 UTC, Panke wrote:
> On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:
>> On 4/18/2015 12:58 AM, John Colvin wrote:
>>> On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
>>>> On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
>>>>> So either you have to throw out all pretenses of Unicode-correctness and
>>>>> just stick with ASCII-style per-character line-wrapping, or you have to
>>>>> live with byGrapheme with all the complexity that it entails. The former
>>>>> is quite easy to write -- I could throw it together in a couple o' hours
>>>>> max, but the latter is a pretty big project (cf. Unicode line-breaking
>>>>> algorithm, which is one of the TR's).
>>>>
>>>> It'd be good enough to duplicate the existing behavior, which is to treat
>>>> decoded unicode characters as one column.
>>>
>>> Code points aren't equivalent to characters. They're not the same thing in most
>>> European languages,
>>
>> I know a bit of German, for what characters is that not true?
>
> Umlauts, if combined characters are used. Also words that still have their accents left after import from foreign languages. E.g. Café
>
> Getting all unicode correct seems a daunting task with a severe performance impact, esp. if we need to assume that a string might have any normalization form or none at all.
>
> See also: http://unicode.org/reports/tr15/#Norm_Forms

Also another issue is that lower case letters and upper case might have different size requirements or look different depending on where on the word they are located.

For example, German ß and SS, Greek σ and ς. I know Turkish also has similar cases.

--
Paulo