January 15, 2011
On 2011-01-14 18:02:32 -0500, foobar <foo@bar.com> said:

> Combining marks do need to be supported.
> Some languages use combining marks extensively (see my other post) and of course font for those languages exist and they do support this. Mac doesn't support all languages so I'm unsure if it's the best example out there.
> here's an example of the Hebrew bible:
> http://www.scripture4all.org/OnlineInterlinear/Hebrew_Index.htm
> 
> Just look at the any of the PDFs there to see how Hebrew looks like with all sorts of different marks.

That's a good example. Although my attempt to extract the text from the PDF wasn't perfect, I can confirm that the marks I got in the copy-pasted text are indeed combining code points, not pre-combined ones.

This character for instance has a combining mark: "יָ"; and it can't be represented by a pre-combined code point because there is no pre-combined form for it (or at least I couldn't find one). Some hebrew characters have a pre-combined form for the middle dot and some other marks, presumably the most common ones, but it was clearly insufficient for this text.


> In the same vain I could have found a Japanese text with ruby (where a Kanji letter has on top of it Hiragana text that tells you how to read it)

Are you sure those are combining code points? I though ruby was a layout feature, not something part of Unicode. And I can't find combining code points that would match those.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 15, 2011
Michel Fortin Wrote:

> > In the same vain I could have found a Japanese text with ruby (where a Kanji letter has on top of it Hiragana text that tells you how to read it)
> 
> Are you sure those are combining code points? I though ruby was a layout feature, not something part of Unicode. And I can't find combining code points that would match those.
> 

I've looked into this and I was wrong. Ruby is a layout feature as you said. Sorry for the confusion.

> 
> -- 
> Michel Fortin
> michel.fortin@michelf.com
> http://michelf.com/
> 

January 15, 2011
Nick Sabalausky wrote:

> "Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message news:ignon1$2p4k$1@digitalmars.com...
>>
>> This may sometimes not be what the user expected; most of the time they'd care about the code points.
>>
> 
> I dunno, spir has succesfuly convinced me that most of the time it's graphemes the user cares about, not code points. Using code points is just as misleading as using UTF-16 code units.

I agree. This is a very informative thread, thanks spir and everybody else.

Going back to the topic, it seems to me that a unicode string is a surprisingly complicated data structure that can be viewed from multiple types of ranges. In the light of this thread, a dchar doesn't seem like such a useful type anymore, it is still a low level abstraction for the purpose of correctly dealing with text. Perhaps even less useful, since it gives the illusion of correctness for those who are not in the know.

The algorithms in std.string can be upgraded to work correctly with all the issues mentioned, but the generic ones in std.algorithm will just subtly do the wrong thing when presented with dchar ranges. And, as I understood it, the purpose of a VleRange was exactly to make generic algorithms just work (tm) for strings.

Is it still possible to solve this problem or are we stuck with specialized string algorithms? Would it work if VleRange of string was a bidirectional range with string slices of graphemes as the ElementType and indexing with code units? Often used string algorithms could be specialized for performance, but if not, generic algorithms would still work.


January 15, 2011
On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn <lutger.blijdestijn@gmail.com> said:

> Nick Sabalausky wrote:
> 
>> "Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message
>> news:ignon1$2p4k$1@digitalmars.com...
>>> 
>>> This may sometimes not be what the user expected; most of the time they'd
>>> care about the code points.
>>> 
>> 
>> I dunno, spir has succesfuly convinced me that most of the time it's
>> graphemes the user cares about, not code points. Using code points is just
>> as misleading as using UTF-16 code units.
> 
> I agree. This is a very informative thread, thanks spir and everybody else.
> 
> Going back to the topic, it seems to me that a unicode string is a
> surprisingly complicated data structure that can be viewed from multiple
> types of ranges. In the light of this thread, a dchar doesn't seem like such
> a useful type anymore, it is still a low level abstraction for the purpose
> of correctly dealing with text. Perhaps even less useful, since it gives the
> illusion of correctness for those who are not in the know.
> 
> The algorithms in std.string can be upgraded to work correctly with all the
> issues mentioned, but the generic ones in std.algorithm will just subtly do
> the wrong thing when presented with dchar ranges. And, as I understood it,
> the purpose of a VleRange was exactly to make generic algorithms just work
> (tm) for strings.
> 
> Is it still possible to solve this problem or are we stuck with specialized
> string algorithms? Would it work if VleRange of string was a bidirectional
> range with string slices of graphemes as the ElementType and indexing with
> code units? Often used string algorithms could be specialized for
> performance, but if not, generic algorithms would still work.

I have my idea.

I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme.

The second component would be to make the string equality operator (==) for strings compare them in their normalized form, so that ("e" with combining acute accent) == (pre-combined "é"). I think this would make D support for Unicode much more intuitive.

This implies some semantic changes, mainly that everywhere you write a "character" you must use double-quotes (string "a") instead of single quote (code point 'a'), but from the user's point of view that's pretty much all there is to change.

There'll still be plenty of room for specialized algorithms, but their purpose would be limited to optimization. Correctness would be taken care of by the basic range interface, and foreach should follow suit and iterate by grapheme by default.

I wrote this example (or something similar) earlier in this thread:

	foreach (grapheme; "exposé")
		if (grapheme == "é")
			break;

In this example, even if one of these two strings use the pre-combined form of "é" and the other uses a combining acute accent, the equality would still hold since foreach iterates on full graphemes and == compares using normalization.

The important thing to keep in mind here is that the grapheme-splitting algorithm should be optimized for the case where there is no combining character and the compare algorithm for the case where the string is already normalized, since most strings will exhibit these characteristics.

As for ASCII, we could make it easier to use ubyte[] for it by making string literals implicitly convert to ubyte[] if all their characters are in ASCII range.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 15, 2011
Michel Fortin wrote:

> On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn <lutger.blijdestijn@gmail.com> said:
...
>> 
>> Is it still possible to solve this problem or are we stuck with specialized string algorithms? Would it work if VleRange of string was a bidirectional range with string slices of graphemes as the ElementType and indexing with code units? Often used string algorithms could be specialized for performance, but if not, generic algorithms would still work.
> 
> I have my idea.
> 
> I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme.
> 
...

Yes, this is exactly what I meant, but you are much clearer. I hope this can be made to work!

January 15, 2011
Lutger Blijdestijn Wrote:

> Michel Fortin wrote:
> 
> > On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn <lutger.blijdestijn@gmail.com> said:
> ...
> >> 
> >> Is it still possible to solve this problem or are we stuck with specialized string algorithms? Would it work if VleRange of string was a bidirectional range with string slices of graphemes as the ElementType and indexing with code units? Often used string algorithms could be specialized for performance, but if not, generic algorithms would still work.
> > 
> > I have my idea.
> > 
> > I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme.
> > 
> ...
> 
> Yes, this is exactly what I meant, but you are much clearer. I hope this can be made to work!
> 

My two cents are against this kind of design. The "correct" approach IMO is a 'universal text' type which is a _container_ of said text. This type would provide ranges for the various abstraction levels. E.g. text.codeUnits to iterate by codeUnits

Here's a (perhaps contrived) example:
Let's say I want to find the combining marks in some text.

For instance, Hebrew uses combining marks for vowels (among other things) and they are optional in the language (There's a "full" form with vowels and a "missing" form without them).
I have a Hebrew text with in the "full" form and I want to strip it and convert it to the "missing" form.

How would I accomplish this with your design?

January 15, 2011
On 2011-01-15 09:09:17 -0500, foobar <foo@bar.com> said:

> Lutger Blijdestijn Wrote:
> 
>> Michel Fortin wrote:
>> 
>>> On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
>>> <lutger.blijdestijn@gmail.com> said:
>> ...
>>>> 
>>>> Is it still possible to solve this problem or are we stuck with
>>>> specialized string algorithms? Would it work if VleRange of string was a
>>>> bidirectional range with string slices of graphemes as the ElementType
>>>> and indexing with code units? Often used string algorithms could be
>>>> specialized for performance, but if not, generic algorithms would still
>>>> work.
>>> 
>>> I have my idea.
>>> 
>>> I think it'd be a good idea is to improve upon Andrei's first idea --
>>> which was to treat char[], wchar[], and dchar[] all as ranges of dchar
>>> elements -- by changing the element type to be the same as the string.
>>> For instance, iterating on a char[] would give you slices of char[],
>>> each having one grapheme.
>>> 
>> ...
>> 
>> Yes, this is exactly what I meant, but you are much clearer. I hope this can
>> be made to work!
>> 
> 
> My two cents are against this kind of design.
> The "correct" approach IMO is a 'universal text' type which is a _container_ of said text. This type would provide ranges for the various abstraction levels. E.g.
> text.codeUnits to iterate by codeUnits

Nothing prevents that in the design I proposed. Andrei's design already implements "str".byDchar() that would work for code points. I'd suggest changing the API to by!char(), by!wchar(), and by!cdhar() for when you deal with whatever kind of code unit or code point you want. This would be mostly symmetric to what you can already do with foreach:

	foreach (char c; "hello") {}
	foreach (wchar c; "hello") {}
	foreach (dchar c; "hello") {}
// same as:
	foreach (c; "hello".by!char()) {}
	foreach (c; "hello".by!wchar()) {}
	foreach (c; "hello".by!dchar()) {}


> Here's a (perhaps contrived) example:
> Let's say I want to find the combining marks in some text.
> 
> For instance, Hebrew uses combining marks for vowels (among other things) and they are optional in the language (There's a "full" form with vowels and a "missing" form without them).
> I have a Hebrew text with in the "full" form and I want to strip it and convert it to the "missing" form.
> 
> How would I accomplish this with your design?

All you need is a range that takes a string as input and give you code points in a decomposed form (NFD), then you use std.algorithm.filter on it:

	// original string
	auto str = "...";

	// create normalized decomposed string as a lazy range of dchar (NFD)
	auto decomposed = decompose(str);

	// filter to remove your favorite combining code point (use the hex code you want)
	auto filtered = filter!"a != 0xFABA"(decomposed);

	// turn it back in composed form (NFC), optional
	auto recomposed = compose(filtered);

	// convert back to a string (could also be wstring or dstring)
	string result = array(recomposed.by!char());

This last line is the one doing everything. All the rest just chain ranges together for doing on-the-fly decomposition, filtering, and recomposition; the last line uses that chain of rage to fill the array.

A more naive implementation not taking advantage of code points but instead using a replacement table would also work:

	string str = "...";
	string result;
	string[string] replacements = ["é":"e"]; // change this for what you want
	foreach (grapheme; str) {
		auto replacement = grapheme in replacements;
		if (replacement)
			result ~= replacement;
		else
			result ~= grapheme;
	}
	

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 15, 2011
Michel Fortin Wrote:

> On 2011-01-15 09:09:17 -0500, foobar <foo@bar.com> said:
> 
> > Lutger Blijdestijn Wrote:
> > 
> >> Michel Fortin wrote:
> >> 
> >>> On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn <lutger.blijdestijn@gmail.com> said:
> >> ...
> >>>> 
> >>>> Is it still possible to solve this problem or are we stuck with specialized string algorithms? Would it work if VleRange of string was a bidirectional range with string slices of graphemes as the ElementType and indexing with code units? Often used string algorithms could be specialized for performance, but if not, generic algorithms would still work.
> >>> 
> >>> I have my idea.
> >>> 
> >>> I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme.
> >>> 
> >> ...
> >> 
> >> Yes, this is exactly what I meant, but you are much clearer. I hope this can be made to work!
> >> 
> > 
> > My two cents are against this kind of design.
> > The "correct" approach IMO is a 'universal text' type which is a
> > _container_ of said text. This type would provide ranges for the
> > various abstraction levels. E.g.
> > text.codeUnits to iterate by codeUnits
> 
> Nothing prevents that in the design I proposed. Andrei's design already implements "str".byDchar() that would work for code points. I'd suggest changing the API to by!char(), by!wchar(), and by!cdhar() for when you deal with whatever kind of code unit or code point you want. This would be mostly symmetric to what you can already do with foreach:
> 
> 	foreach (char c; "hello") {}
> 	foreach (wchar c; "hello") {}
> 	foreach (dchar c; "hello") {}
> // same as:
> 	foreach (c; "hello".by!char()) {}
> 	foreach (c; "hello".by!wchar()) {}
> 	foreach (c; "hello".by!dchar()) {}
> 
> 
> > Here's a (perhaps contrived) example:
> > Let's say I want to find the combining marks in some text.
> > 
> > For instance, Hebrew uses combining marks for vowels (among other
> > things) and they are optional in the language (There's a "full" form
> > with vowels and a "missing" form without them).
> > I have a Hebrew text with in the "full" form and I want to strip it and
> > convert it to the "missing" form.
> > 
> > How would I accomplish this with your design?
> 
> All you need is a range that takes a string as input and give you code points in a decomposed form (NFD), then you use std.algorithm.filter on it:
> 
> 	// original string
> 	auto str = "...";
> 
> 	// create normalized decomposed string as a lazy range of dchar (NFD)
> 	auto decomposed = decompose(str);
> 
> 	// filter to remove your favorite combining code point (use the hex
> code you want)
> 	auto filtered = filter!"a != 0xFABA"(decomposed);
> 
> 	// turn it back in composed form (NFC), optional
> 	auto recomposed = compose(filtered);
> 
> 	// convert back to a string (could also be wstring or dstring)
> 	string result = array(recomposed.by!char());
> 
> This last line is the one doing everything. All the rest just chain ranges together for doing on-the-fly decomposition, filtering, and recomposition; the last line uses that chain of rage to fill the array.
> 
> A more naive implementation not taking advantage of code points but instead using a replacement table would also work:
> 
> 	string str = "...";
> 	string result;
> 	string[string] replacements = ["é":"e"]; // change this for what you want
> 	foreach (grapheme; str) {
> 		auto replacement = grapheme in replacements;
> 		if (replacement)
> 			result ~= replacement;
> 		else
> 			result ~= grapheme;
> 	}
> 
> 
> -- 
> Michel Fortin
> michel.fortin@michelf.com
> http://michelf.com/
> 

Ok, I guess I missed the "byDchar()" method.
I envisioned the same algorithm looking like this:

// original string
string str = "...";

// create normalized decomposed string as a lazy range of dchar (NFD)
// Note: explicitly specify code points range:
auto decomposed = decompose(str.codePoints);

// filter to remove your favorite combining code point
auto filtered = filter!"a != 0xFABA"(decomposed);

// turn it back in composed form (NFC), optional
auto recomposed = compose(filtered);

// convert back to a string
// Note: a string type can be constructed from a range of code points
string result = string(recomposed);

The difference is that a string type is distinct from the intermediate code point ranges (This happens in your design too albeit in a less obvious way to the user). There is string specific code. Why not encapsulate it in a string type instead of forcing the user to use complex APIs with templates everywhere?

January 15, 2011
On Fri, 14 Jan 2011 15:54:19 -0500, Gerrit Wichert <gwichert@yahoo.com> wrote:

> Am 14.01.2011 15:34, schrieb Steven Schveighoffer:
>>
>> Is it common to have multiple modifiers on a single character?  The
>> problem I see with using decomposed canonical form for strings is that
>> we would have to return a dchar[] for each 'element', which severely
>> complicates code that, for instance, only expects to handle English.
>>
>> I was hoping to lazily transform a string into its composed canonical
>> form, allowing the (hopefully rare) exception when a composed
>> character does not exist.  My thinking was that this at least gives a
>> useful string representation for 90% of usages, leaving the remaining
>> 10% of usages to find a more complex representation (like your Text
>> type).  If we only get like 20% or 30% there by making dchar the
>> element type, then we haven't made it useful enough.
>>
> I'm afraid that this is not a proper way to handle this problem. It may
> be better for a language not to 'translate' by default.
> If the user wants to convert the codepoints this can be requested on
> demand. But pemature default conversion is a subltle way to lose
> information that may be important.
> Imagine we want to write a tool for dealing with the in/output of some
> other ignorant legacy software. Even if it is only text files, that
> software may choke on some converted input. So i belive that it is very
> importent that we are able to reproduce strings in exact that form in
> which we have read them in.

Actually, this would only lazily *and temporarily* convert the string per grapheme.  Essentially, the original is left alone, so no harm there.

-Steve.
January 15, 2011
On 2011-01-15 10:59:52 -0500, foobar <foo@bar.com> said:

> Ok, I guess I missed the "byDchar()" method.
> I envisioned the same algorithm looking like this:
> 
> // original string
> string str = "...";
> 
> // create normalized decomposed string as a lazy range of dchar (NFD)
> // Note: explicitly specify code points range:
> auto decomposed = decompose(str.codePoints);
> 
> // filter to remove your favorite combining code point
> auto filtered = filter!"a != 0xFABA"(decomposed);
> 
> // turn it back in composed form (NFC), optional
> auto recomposed = compose(filtered);
> 
> // convert back to a string
> // Note: a string type can be constructed from a range of code points
> string result = string(recomposed);
> 
> The difference is that a string type is distinct from the intermediate code point ranges (This happens in your design too albeit in a less obvious way to the user). There is string specific code. Why not encapsulate it in a string type instead of forcing the user to use complex APIs with templates everywhere?

What I don't understand is in what way using a string type would make the API less complex and use less templates?

More generally, in what way would your string type behave differently than char[], wchar[], and dchar[]? I think we need to clarify what how you expect your string type to behave before I can answer anything. I mean, beside cosmetic changes such as having a codePoint property instead of by!dchar or byDchar, what is your string type doing differently?

The above algorithm is already possible with strings as they are, provided you implement the 'decompose' and the 'compose' function returning a range. In fact, you only changed two things in it: by!dchar became codePoints, and array() became string(). Surely you're expecting more benefits than that.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/