May 29, 2007
Regan Heath wrote:

> The default language/library support can reverse utf8 and 16 but it's not ideal, eg.  convert to utf32, reverse, convert back. ;)
> 
> Regan

I am not sure what do you mean with this sentence...

dstring implementation doesn't do things according to your description, so it's definitely not a case here...


-- 
Regards
Marcin Kuszczak (Aarti_pl)
-------------------------------------
Ask me why I believe in Jesus - http://zapytaj.dlajezusa.pl (en/pl)
Doost (port of few Boost libraries) - http://www.dsource.org/projects/doost/
-------------------------------------

May 29, 2007
Marcin Kuszczak Wrote:
> Regan Heath wrote:
> 
> > The default language/library support can reverse utf8 and 16 but it's not ideal, eg.  convert to utf32, reverse, convert back. ;)
> > 
> > Regan
> 
> I am not sure what do you mean with this sentence...
> 
> dstring implementation doesn't do things according to your description, so it's definitely not a case here...

I'm lost, what is "dstring"?

All I meant was that using std.utf you can say:

char[] text = "<characters which take more than 1 char to represent>";

text = toUTF8(toUTF32(text).reverse);

and the result will be a correctly reversed UTF8 string.  Or am I missing something?

Regan Heath
May 29, 2007
On Tue, 29 May 2007 20:41:31 +0200, Regan Heath <regan@netmail.co.nz> wrote:
> and the result will be a correctly reversed UTF8 string.  Or am I missing something?
>
> Regan Heath

I think your method doesn't take compound characters into account.

For example:
// The accented é can be represented by a single code-point. But let's assume it's a compound character (Ce`a).
writefln( toUTF8(toUTF32("Céa").reverse) ) // would reverse to a`eC
// This would print áeC
May 29, 2007
Aziz K. Wrote:
> On Tue, 29 May 2007 20:41:31 +0200, Regan Heath <regan@netmail.co.nz> wrote:
> > and the result will be a correctly reversed UTF8 string.  Or am I missing something?
> >
> > Regan Heath
> 
> I think your method doesn't take compound characters into account.
> 
> For example:
> // The accented é can be represented by a single code-point. But let's
> assume it's a compound character (Ce`a).

Is it a compound character in UTF32?

> writefln( toUTF8(toUTF32("Céa").reverse) ) // would reverse to a`eC
> // This would print áeC

Can you code that test up (using the \U character literal syntax so that the web interface doesn't mangle it) I'd like to play with it.

My statement was based on the assumption that converting UTF8 to UTF32 would result in all the compound characters being converted/represented by a single UTF32 codepoint each and would therefore be reversable.

Regan
May 29, 2007
Reiner Pope wrote:
> Frits van Bommel wrote:
>> For other cases though, I could see how a "unique" (or similar) type constructor that would allow implicit conversion to both mutable and invariant (and const) types could be useful.
>> For instance, if the strings in your example were replaced by mutable arrays, a "unique char[]" return value of .dup could then be assigned to mutable/const/invariant references without needing casts.
> Funny, that's just what I thought of (including the name unique).

I'm pretty sure this has been suggested in these newsgroups in the past, including using "unique" as the keyword.

> When I  first thought about it, I thought that such a construct would be very useful and very powerful, but I can't actually think of any use cases except with .dup and other constructor-type functions. (Although supporting them should alone be enough motivation).

Some use cases I can think of:
* Obviously, builtin array property .dup, as you mentioned.
* std.utf.toUTF* (except the non-converting ones such as char[] -> char[])
* The result of certain operator overloads (arithmetic in a bignum class, opCat in a string class, the result of the builtin ~ operator for arrays)
* Lots of stuff in std.string: join, split, maketrans, all the toString overloads, format, succ, abbrev. (AFAIK all of these are guaranteed to return a unique array)
* toString overloads for classes that return the result of any of the above[1] (especially builtin ~ and std.string.format are often useful in toString, in my experience).

As you can see, there are plenty of cases where newly allocated objects or arrays are returned.


[1]: This one would require the ability to add "unique" in an overridden method, since it's a bad idea to require it of all classes. This could be considered to fall under the category of covariant return values.
May 29, 2007
Regan Heath wrote:
> Aziz K. Wrote:
>> On Tue, 29 May 2007 20:41:31 +0200, Regan Heath <regan@netmail.co.nz>  wrote:
>>> and the result will be a correctly reversed UTF8 string.  Or am I  missing something?
>>>
>>> Regan Heath
>> I think your method doesn't take compound characters into account.
>>
>> For example:
>> // The accented é can be represented by a single code-point. But let's  assume it's a compound character (Ce`a).
> 
> Is it a compound character in UTF32?

Unicode defines multiple valid encodings for lots of accented characters; typically a single codepoint as well as separate codepoints for the accent and the "naked" character that combine when put together.

>> writefln( toUTF8(toUTF32("Céa").reverse) ) // would reverse to a`eC
>> // This would print áeC
> 
> Can you code that test up (using the \U character literal syntax so that the web interface doesn't mangle it) I'd like to play with it.
> 
> My statement was based on the assumption that converting UTF8 to UTF32 would result in all the compound characters being converted/represented by a single UTF32 codepoint each and would therefore be reversable.

I don't think std.utf.toUTF* combine or split accented characters, I'm pretty sure it just does codepoint representation conversions (keeping the number of codepoints constant).
May 29, 2007
Regan Heath wrote:

> Marcin Kuszczak Wrote:
>> Regan Heath wrote:
>> 
>> > The default language/library support can reverse utf8 and 16 but it's not ideal, eg.  convert to utf32, reverse, convert back. ;)
>> > 
>> > Regan
>> 
>> I am not sure what do you mean with this sentence...
>> 
>> dstring implementation doesn't do things according to your description, so it's definitely not a case here...
> 
> I'm lost, what is "dstring"?
> 
> All I meant was that using std.utf you can say:
> 
> char[] text = "<characters which take more than 1 char to represent>";
> 
> text = toUTF8(toUTF32(text).reverse);
> 
> and the result will be a correctly reversed UTF8 string.  Or am I missing something?
> 
> Regan Heath

dstring is implementation of string struct by Chris Miller which takes care about slicing utf8 sequences and is compatible with char[], wchar[] and dchar[]. I mentioned it because I think that it's better when foreach know nothing about slicing utf8 sequence (opposite to way it is implemented currently). It should be responsibility of string class (like e.g. dstring) with proper opApply method. Because my previous e-mail was in context of dstring, I haven't understood what did you mean... 'reverse' and 'sort' could be also implemented in such class in a way which will cope properly with utf8 sequences...

http://www.digitalmars.com/d/archives/digitalmars/D/announce/New_string_implementation_dstring_1.0_4886.html http://www.dprogramming.com/dstring.php


-- 
Regards
Marcin Kuszczak (Aarti_pl)
-------------------------------------
Ask me why I believe in Jesus - http://zapytaj.dlajezusa.pl (en/pl)
Doost (port of few Boost libraries) - http://www.dsource.org/projects/doost/
-------------------------------------

May 30, 2007
Frits van Bommel Wrote:
> Regan Heath wrote:
> > Aziz K. Wrote:
> >> On Tue, 29 May 2007 20:41:31 +0200, Regan Heath <regan@netmail.co.nz> wrote:
> >>> and the result will be a correctly reversed UTF8 string.  Or am I missing something?
> >>>
> >>> Regan Heath
> >> I think your method doesn't take compound characters into account.
> >>
> >> For example:
> >> // The accented é can be represented by a single code-point. But let's
> >> assume it's a compound character (Ce`a).
> > 
> > Is it a compound character in UTF32?
> 
> Unicode defines multiple valid encodings for lots of accented characters; typically a single codepoint as well as separate codepoints for the accent and the "naked" character that combine when put together.

I realise that.  But, the important question is what does toUTF32 do with compound UTF8 characters (or UTF16 for that matter)?

> >> writefln( toUTF8(toUTF32("Céa").reverse) ) // would reverse to a`eC
> >> // This would print áeC
> > 
> > Can you code that test up (using the \U character literal syntax so that the web interface doesn't mangle it) I'd like to play with it.
> > 
> > My statement was based on the assumption that converting UTF8 to UTF32 would result in all the compound characters being converted/represented by a single UTF32 codepoint each and would therefore be reversable.
> 
> I don't think std.utf.toUTF* combine or split accented characters, I'm pretty sure it just does codepoint representation conversions (keeping the number of codepoints constant).

This is the key issue.  I was under the (perhaps mistaken) impression it converted them to the single codepoint version (as that was easier), which is what I based this idea on.  Really a simple test should tell us, can you whip one up to prove it one way or the other?

I would, but I don't really use unicode at all and I don't know any compound characters offhand.  I know, I know, I could google it but I also get the impression you know a bit more about this and would be able to devise a better test case, or two.

Ahh.. another thought.  I think I may have based my assumption on the foreach behaviour, eg.

char[] text = "<compund stuff>";
foreach(dchar d; text) { .. }

this _has_ to give the single codepoint versions, right?

I suspect foreach uses the same code as in std.utf, but I may be wrong.

Regan Heath
May 30, 2007
Marcin Kuszczak Wrote:
> Regan Heath wrote:
> 
> > Marcin Kuszczak Wrote:
> >> Regan Heath wrote:
> >> 
> >> > The default language/library support can reverse utf8 and 16 but it's not ideal, eg.  convert to utf32, reverse, convert back. ;)
> >> > 
> >> > Regan
> >> 
> >> I am not sure what do you mean with this sentence...
> >> 
> >> dstring implementation doesn't do things according to your description, so it's definitely not a case here...
> > 
> > I'm lost, what is "dstring"?
> > 
> > All I meant was that using std.utf you can say:
> > 
> > char[] text = "<characters which take more than 1 char to represent>";
> > 
> > text = toUTF8(toUTF32(text).reverse);
> > 
> > and the result will be a correctly reversed UTF8 string.  Or am I missing something?
> > 
> > Regan Heath
> 
> dstring is implementation of string struct by Chris Miller which takes care about slicing utf8 sequences and is compatible with char[], wchar[] and dchar[]. I mentioned it because I think that it's better when foreach know nothing about slicing utf8 sequence (opposite to way it is implemented currently). It should be responsibility of string class (like e.g. dstring) with proper opApply method. Because my previous e-mail was in context of dstring, I haven't understood what did you mean... 'reverse' and 'sort' could be also implemented in such class in a way which will cope properly with utf8 sequences...

Ahh, thanks, that clears up the confusion I had.  Yes, a string class/struct could definately handle the codepoint issue.  It would also be able to handle it better than the method I suggested, which is a brute force method based on an assumption which may prove to be false (I suspect toUTF32 it converts UTF8 and 16 to non-compound UTF32 in all cases.  But I could be wrong)

But to respond to your original point (which I didn't address earlier, sorry) I have no problem with the foreach behaviour:

char[] text = "<compound characters>";
foreach(dchar c; text) { .. }

because, I suspect, the code which handles this is in std.utf (toUTF32) already.  You seem to want to move the behaviour to a string class, but why can't it exist in both places?

I guess the problem you might have with it is that it effectively says to someone implementing a D compiler:  You need to handle conversions from/to UTF8, 16 and 32 and (assuming I am correct about toUTF32) you need to convert UTF8 and 16 to non-compound UTF32.

Which might make it harder for someone to implement a D compiler.  I don't know.

Regan Heath
May 30, 2007
Regan Heath wrote:
> Frits van Bommel Wrote:
>> Regan Heath wrote:
>>> Aziz K. Wrote:
>>>> writefln( toUTF8(toUTF32("Céa").reverse) ) // would reverse to a`eC
>>>> // This would print áeC
>>> Can you code that test up (using the \U character literal syntax so that the web interface doesn't mangle it) I'd like to play with it.
>>>
>>> My statement was based on the assumption that converting UTF8 to UTF32 would result in all the compound characters being converted/represented by a single UTF32 codepoint each and would therefore be reversable.
>> I don't think std.utf.toUTF* combine or split accented characters, I'm pretty sure it just does codepoint representation conversions (keeping the number of codepoints constant).
> 
> This is the key issue.  I was under the (perhaps mistaken) impression it converted them to the single codepoint version (as that was easier), which is what I based this idea on.  Really a simple test should tell us, can you whip one up to prove it one way or the other?  

---
import std.stdio;
import std.utf;

void main(char[][] args) {
    // Codepoint 0301 is "Combining acute accent".
    // Codepoint 00e9 is "Latin small letter e with acute"
    char[] str = "e\u0301 \u00e9";

    // This doesn't show the combined character on my console.
    // Perhaps my terminal doesn't properly support combining characters.
    // (My encoding is utf-8, so that shouldn't be the problem)
    // The precomposed character (00e9) is displayed properly.
    // When piped to a .html file and wrapped with
    // <html><body>...</body></html> firefox properly displays both.
    writefln(str);
    foreach (dchar c; str) {
        writef("%04x ", c);
    }
    writefln();

    // This produces the exact same output as above code:
    dchar[] dstr = toUTF32(str);
    writefln(dstr);
    foreach (dchar c; dstr) {
        writef("%04x ", c);
    }
    writefln();
}
---

> I would, but I don't really use unicode at all and I don't know any compound characters offhand.  I know, I know, I could google it but I also get the impression you know a bit more about this and would be able to devise a better test case, or two.

I normally have little use for it as well. A few Dutch (my native tongue) words need accents, but I'll be damned if I know the codes. Let alone those of any combining characters. My usual way of typing those is either using the symbol map or just typing it without accents, right-click, select spell-check suggestion with accents :).
However, for above test I just looked up the codes in the code charts on the unicode website (unicode.org/charts for the precomposed character and the "symbols and punctuation" link at the top for the combining accent). It's pretty easy to find, actually.

> Ahh.. another thought.  I think I may have based my assumption on the foreach behaviour, eg.
> 
> char[] text = "<compund stuff>";
> foreach(dchar d; text) { .. }
> 
> this _has_ to give the single codepoint versions, right?

As demonstrated above, it doesn't. The runtime support for the converting foreach statements just imports std.utf and use decode and toUTF*[1] (as well as some manual conversion to surrogates in the functions dealing with wchar). None of those do anything other than decoding and encoding single codepoints.


[1]: The apparently undocumented (buf, dchar) overloads, which don't allocate.

> I suspect foreach uses the same code as in std.utf, but I may be wrong.

About this, you're not :P.


I suspect the reason std.utf doesn't do decomposition and/or combining is that it would require a lookup table, and possibly quite a big one at that. Though generating it shouldn't be a problem; it could be trivially extracted from the machine-readable data on the unicode website. Just take http://www.unicode.org/Public/UNIDATA/UnicodeData.txt, the sixth column is the decomposition of the character in the first column. (It may also contain the mapping type between <angle brackets>)
Note that for full decomposition this mapping needs to be applied recursively[2], i.e. the characters in the 6th column need to be decomposed as well (if possible).

[2]: See the reminder in http://www.unicode.org/Public/UNIDATA/UCD.html#Character_Decomposition_Mappings