View mode: basic / threaded / horizontal-split · Log in · Help
May 29, 2007
Re: string types: const(char)[] and cstring
Regan Heath wrote:

> The default language/library support can reverse utf8 and 16 but it's not
> ideal, eg.  convert to utf32, reverse, convert back. ;)
> 
> Regan

I am not sure what do you mean with this sentence... 

dstring implementation doesn't do things according to your description, so
it's definitely not a case here...


-- 
Regards
Marcin Kuszczak (Aarti_pl)
-------------------------------------
Ask me why I believe in Jesus - http://zapytaj.dlajezusa.pl (en/pl)
Doost (port of few Boost libraries) - http://www.dsource.org/projects/doost/
-------------------------------------
May 29, 2007
Re: string types: const(char)[] and cstring
Marcin Kuszczak Wrote:
> Regan Heath wrote:
> 
> > The default language/library support can reverse utf8 and 16 but it's not
> > ideal, eg.  convert to utf32, reverse, convert back. ;)
> > 
> > Regan
> 
> I am not sure what do you mean with this sentence... 
> 
> dstring implementation doesn't do things according to your description, so
> it's definitely not a case here...

I'm lost, what is "dstring"?

All I meant was that using std.utf you can say:

char[] text = "<characters which take more than 1 char to represent>";

text = toUTF8(toUTF32(text).reverse);

and the result will be a correctly reversed UTF8 string.  Or am I missing something?

Regan Heath
May 29, 2007
Re: string types: const(char)[] and cstring
On Tue, 29 May 2007 20:41:31 +0200, Regan Heath <regan@netmail.co.nz>  
wrote:
> and the result will be a correctly reversed UTF8 string.  Or am I  
> missing something?
>
> Regan Heath

I think your method doesn't take compound characters into account.

For example:
// The accented é can be represented by a single code-point. But let's  
assume it's a compound character (Ce`a).
writefln( toUTF8(toUTF32("Céa").reverse) ) // would reverse to a`eC
// This would print áeC
May 29, 2007
Re: string types: const(char)[] and cstring
Aziz K. Wrote:
> On Tue, 29 May 2007 20:41:31 +0200, Regan Heath <regan@netmail.co.nz>  
> wrote:
> > and the result will be a correctly reversed UTF8 string.  Or am I  
> > missing something?
> >
> > Regan Heath
> 
> I think your method doesn't take compound characters into account.
> 
> For example:
> // The accented é can be represented by a single code-point. But let's  
> assume it's a compound character (Ce`a).

Is it a compound character in UTF32?

> writefln( toUTF8(toUTF32("Céa").reverse) ) // would reverse to a`eC
> // This would print áeC

Can you code that test up (using the \U character literal syntax so that the web interface doesn't mangle it) I'd like to play with it.

My statement was based on the assumption that converting UTF8 to UTF32 would result in all the compound characters being converted/represented by a single UTF32 codepoint each and would therefore be reversable.

Regan
May 29, 2007
Re: string types: const(char)[] and cstring
Reiner Pope wrote:
> Frits van Bommel wrote:
>> For other cases though, I could see how a "unique" (or similar) type 
>> constructor that would allow implicit conversion to both mutable and 
>> invariant (and const) types could be useful.
>> For instance, if the strings in your example were replaced by mutable 
>> arrays, a "unique char[]" return value of .dup could then be assigned 
>> to mutable/const/invariant references without needing casts.
> Funny, that's just what I thought of (including the name unique).

I'm pretty sure this has been suggested in these newsgroups in the past, 
including using "unique" as the keyword.

> When I 
>  first thought about it, I thought that such a construct would be very 
> useful and very powerful, but I can't actually think of any use cases 
> except with .dup and other constructor-type functions. (Although 
> supporting them should alone be enough motivation).

Some use cases I can think of:
* Obviously, builtin array property .dup, as you mentioned.
* std.utf.toUTF* (except the non-converting ones such as char[] -> char[])
* The result of certain operator overloads (arithmetic in a bignum 
class, opCat in a string class, the result of the builtin ~ operator for 
arrays)
* Lots of stuff in std.string: join, split, maketrans, all the toString 
overloads, format, succ, abbrev. (AFAIK all of these are guaranteed to 
return a unique array)
* toString overloads for classes that return the result of any of the 
above[1] (especially builtin ~ and std.string.format are often useful in 
toString, in my experience).

As you can see, there are plenty of cases where newly allocated objects 
or arrays are returned.


[1]: This one would require the ability to add "unique" in an overridden 
method, since it's a bad idea to require it of all classes. This could 
be considered to fall under the category of covariant return values.
May 29, 2007
Re: string types: const(char)[] and cstring
Regan Heath wrote:
> Aziz K. Wrote:
>> On Tue, 29 May 2007 20:41:31 +0200, Regan Heath <regan@netmail.co.nz>  
>> wrote:
>>> and the result will be a correctly reversed UTF8 string.  Or am I  
>>> missing something?
>>>
>>> Regan Heath
>> I think your method doesn't take compound characters into account.
>>
>> For example:
>> // The accented é can be represented by a single code-point. But let's  
>> assume it's a compound character (Ce`a).
> 
> Is it a compound character in UTF32?

Unicode defines multiple valid encodings for lots of accented 
characters; typically a single codepoint as well as separate codepoints 
for the accent and the "naked" character that combine when put together.

>> writefln( toUTF8(toUTF32("Céa").reverse) ) // would reverse to a`eC
>> // This would print áeC
> 
> Can you code that test up (using the \U character literal syntax so that the web interface doesn't mangle it) I'd like to play with it.
> 
> My statement was based on the assumption that converting UTF8 to UTF32 would result in all the compound characters being converted/represented by a single UTF32 codepoint each and would therefore be reversable.

I don't think std.utf.toUTF* combine or split accented characters, I'm 
pretty sure it just does codepoint representation conversions (keeping 
the number of codepoints constant).
May 29, 2007
Re: string types: const(char)[] and cstring
Regan Heath wrote:

> Marcin Kuszczak Wrote:
>> Regan Heath wrote:
>> 
>> > The default language/library support can reverse utf8 and 16 but it's
>> > not ideal, eg.  convert to utf32, reverse, convert back. ;)
>> > 
>> > Regan
>> 
>> I am not sure what do you mean with this sentence...
>> 
>> dstring implementation doesn't do things according to your description,
>> so it's definitely not a case here...
> 
> I'm lost, what is "dstring"?
> 
> All I meant was that using std.utf you can say:
> 
> char[] text = "<characters which take more than 1 char to represent>";
> 
> text = toUTF8(toUTF32(text).reverse);
> 
> and the result will be a correctly reversed UTF8 string.  Or am I missing
> something?
> 
> Regan Heath

dstring is implementation of string struct by Chris Miller which takes care
about slicing utf8 sequences and is compatible with char[], wchar[] and
dchar[]. I mentioned it because I think that it's better when foreach know
nothing about slicing utf8 sequence (opposite to way it is implemented
currently). It should be responsibility of string class (like e.g. dstring)
with proper opApply method. Because my previous e-mail was in context of
dstring, I haven't understood what did you mean... 'reverse' and 'sort'
could be also implemented in such class in a way which will cope properly
with utf8 sequences...

http://www.digitalmars.com/d/archives/digitalmars/D/announce/New_string_implementation_dstring_1.0_4886.html
http://www.dprogramming.com/dstring.php


-- 
Regards
Marcin Kuszczak (Aarti_pl)
-------------------------------------
Ask me why I believe in Jesus - http://zapytaj.dlajezusa.pl (en/pl)
Doost (port of few Boost libraries) - http://www.dsource.org/projects/doost/
-------------------------------------
May 30, 2007
Re: string types: const(char)[] and cstring
Frits van Bommel Wrote:
> Regan Heath wrote:
> > Aziz K. Wrote:
> >> On Tue, 29 May 2007 20:41:31 +0200, Regan Heath <regan@netmail.co.nz>  
> >> wrote:
> >>> and the result will be a correctly reversed UTF8 string.  Or am I  
> >>> missing something?
> >>>
> >>> Regan Heath
> >> I think your method doesn't take compound characters into account.
> >>
> >> For example:
> >> // The accented é can be represented by a single code-point. But let's  
> >> assume it's a compound character (Ce`a).
> > 
> > Is it a compound character in UTF32?
> 
> Unicode defines multiple valid encodings for lots of accented 
> characters; typically a single codepoint as well as separate codepoints 
> for the accent and the "naked" character that combine when put together.

I realise that.  But, the important question is what does toUTF32 do with compound UTF8 characters (or UTF16 for that matter)?  

> >> writefln( toUTF8(toUTF32("Céa").reverse) ) // would reverse to a`eC
> >> // This would print áeC
> > 
> > Can you code that test up (using the \U character literal syntax so that the web interface doesn't mangle it) I'd like to play with it.
> > 
> > My statement was based on the assumption that converting UTF8 to UTF32 would result in all the compound characters being converted/represented by a single UTF32 codepoint each and would therefore be reversable.
> 
> I don't think std.utf.toUTF* combine or split accented characters, I'm 
> pretty sure it just does codepoint representation conversions (keeping 
> the number of codepoints constant).

This is the key issue.  I was under the (perhaps mistaken) impression it converted them to the single codepoint version (as that was easier), which is what I based this idea on.  Really a simple test should tell us, can you whip one up to prove it one way or the other?  

I would, but I don't really use unicode at all and I don't know any compound characters offhand.  I know, I know, I could google it but I also get the impression you know a bit more about this and would be able to devise a better test case, or two.

Ahh.. another thought.  I think I may have based my assumption on the foreach behaviour, eg.

char[] text = "<compund stuff>";
foreach(dchar d; text) { .. }

this _has_ to give the single codepoint versions, right?

I suspect foreach uses the same code as in std.utf, but I may be wrong.

Regan Heath
May 30, 2007
Re: string types: const(char)[] and cstring
Marcin Kuszczak Wrote:
> Regan Heath wrote:
> 
> > Marcin Kuszczak Wrote:
> >> Regan Heath wrote:
> >> 
> >> > The default language/library support can reverse utf8 and 16 but it's
> >> > not ideal, eg.  convert to utf32, reverse, convert back. ;)
> >> > 
> >> > Regan
> >> 
> >> I am not sure what do you mean with this sentence...
> >> 
> >> dstring implementation doesn't do things according to your description,
> >> so it's definitely not a case here...
> > 
> > I'm lost, what is "dstring"?
> > 
> > All I meant was that using std.utf you can say:
> > 
> > char[] text = "<characters which take more than 1 char to represent>";
> > 
> > text = toUTF8(toUTF32(text).reverse);
> > 
> > and the result will be a correctly reversed UTF8 string.  Or am I missing
> > something?
> > 
> > Regan Heath
> 
> dstring is implementation of string struct by Chris Miller which takes care
> about slicing utf8 sequences and is compatible with char[], wchar[] and
> dchar[]. I mentioned it because I think that it's better when foreach know
> nothing about slicing utf8 sequence (opposite to way it is implemented
> currently). It should be responsibility of string class (like e.g. dstring)
> with proper opApply method. Because my previous e-mail was in context of
> dstring, I haven't understood what did you mean... 'reverse' and 'sort'
> could be also implemented in such class in a way which will cope properly
> with utf8 sequences...

Ahh, thanks, that clears up the confusion I had.  Yes, a string class/struct could definately handle the codepoint issue.  It would also be able to handle it better than the method I suggested, which is a brute force method based on an assumption which may prove to be false (I suspect toUTF32 it converts UTF8 and 16 to non-compound UTF32 in all cases.  But I could be wrong)

But to respond to your original point (which I didn't address earlier, sorry) I have no problem with the foreach behaviour:

char[] text = "<compound characters>";
foreach(dchar c; text) { .. }

because, I suspect, the code which handles this is in std.utf (toUTF32) already.  You seem to want to move the behaviour to a string class, but why can't it exist in both places?

I guess the problem you might have with it is that it effectively says to someone implementing a D compiler:  You need to handle conversions from/to UTF8, 16 and 32 and (assuming I am correct about toUTF32) you need to convert UTF8 and 16 to non-compound UTF32.

Which might make it harder for someone to implement a D compiler.  I don't know.

Regan Heath
May 30, 2007
Re: string types: const(char)[] and cstring
Regan Heath wrote:
> Frits van Bommel Wrote:
>> Regan Heath wrote:
>>> Aziz K. Wrote:
>>>> writefln( toUTF8(toUTF32("Céa").reverse) ) // would reverse to a`eC
>>>> // This would print áeC
>>> Can you code that test up (using the \U character literal syntax so that the web interface doesn't mangle it) I'd like to play with it.
>>>
>>> My statement was based on the assumption that converting UTF8 to UTF32 would result in all the compound characters being converted/represented by a single UTF32 codepoint each and would therefore be reversable.
>> I don't think std.utf.toUTF* combine or split accented characters, I'm 
>> pretty sure it just does codepoint representation conversions (keeping 
>> the number of codepoints constant).
> 
> This is the key issue.  I was under the (perhaps mistaken) impression it converted them to the single codepoint version (as that was easier), which is what I based this idea on.  Really a simple test should tell us, can you whip one up to prove it one way or the other?  

---
import std.stdio;
import std.utf;

void main(char[][] args) {
    // Codepoint 0301 is "Combining acute accent".
    // Codepoint 00e9 is "Latin small letter e with acute"
    char[] str = "e\u0301 \u00e9";

    // This doesn't show the combined character on my console.
    // Perhaps my terminal doesn't properly support combining characters.
    // (My encoding is utf-8, so that shouldn't be the problem)
    // The precomposed character (00e9) is displayed properly.
    // When piped to a .html file and wrapped with
    // <html><body>...</body></html> firefox properly displays both.
    writefln(str);
    foreach (dchar c; str) {
        writef("%04x ", c);
    }
    writefln();

    // This produces the exact same output as above code:
    dchar[] dstr = toUTF32(str);
    writefln(dstr);
    foreach (dchar c; dstr) {
        writef("%04x ", c);
    }
    writefln();
}
---

> I would, but I don't really use unicode at all and I don't know any compound characters offhand.  I know, I know, I could google it but I also get the impression you know a bit more about this and would be able to devise a better test case, or two.

I normally have little use for it as well. A few Dutch (my native 
tongue) words need accents, but I'll be damned if I know the codes. Let 
alone those of any combining characters. My usual way of typing those is 
either using the symbol map or just typing it without accents, 
right-click, select spell-check suggestion with accents :).
However, for above test I just looked up the codes in the code charts on 
the unicode website (unicode.org/charts for the precomposed character 
and the "symbols and punctuation" link at the top for the combining 
accent). It's pretty easy to find, actually.

> Ahh.. another thought.  I think I may have based my assumption on the foreach behaviour, eg.
> 
> char[] text = "<compund stuff>";
> foreach(dchar d; text) { .. }
> 
> this _has_ to give the single codepoint versions, right?

As demonstrated above, it doesn't. The runtime support for the 
converting foreach statements just imports std.utf and use decode and 
toUTF*[1] (as well as some manual conversion to surrogates in the 
functions dealing with wchar). None of those do anything other than 
decoding and encoding single codepoints.


[1]: The apparently undocumented (buf, dchar) overloads, which don't 
allocate.

> I suspect foreach uses the same code as in std.utf, but I may be wrong.

About this, you're not :P.


I suspect the reason std.utf doesn't do decomposition and/or combining 
is that it would require a lookup table, and possibly quite a big one at 
that. Though generating it shouldn't be a problem; it could be trivially 
extracted from the machine-readable data on the unicode website. Just 
take http://www.unicode.org/Public/UNIDATA/UnicodeData.txt, the sixth 
column is the decomposition of the character in the first column. (It 
may also contain the mapping type between <angle brackets>)
Note that for full decomposition this mapping needs to be applied 
recursively[2], i.e. the characters in the 6th column need to be 
decomposed as well (if possible).

[2]: See the reminder in 
http://www.unicode.org/Public/UNIDATA/UCD.html#Character_Decomposition_Mappings
3 4 5 6 7 8 9
Top | Discussion index | About this forum | D home