January 14, 2011
Am 14.01.2011 15:34, schrieb Steven Schveighoffer:
>
> Is it common to have multiple modifiers on a single character?  The problem I see with using decomposed canonical form for strings is that we would have to return a dchar[] for each 'element', which severely complicates code that, for instance, only expects to handle English.
>
> I was hoping to lazily transform a string into its composed canonical form, allowing the (hopefully rare) exception when a composed character does not exist.  My thinking was that this at least gives a useful string representation for 90% of usages, leaving the remaining 10% of usages to find a more complex representation (like your Text type).  If we only get like 20% or 30% there by making dchar the element type, then we haven't made it useful enough.
>
I'm afraid that this is not a proper way to handle this problem. It may
be better for a language not to 'translate' by default.
If the user wants to convert the codepoints this can be requested on
demand. But pemature default conversion is a subltle way to lose
information that may be important.
Imagine we want to write a tool for dealing with the in/output of some
other ignorant legacy software. Even if it is only text files, that
software may choke on some converted input. So i belive that it is very
importent that we are able to reproduce strings in exact that form in
which we have read them in.

Gerrit
January 14, 2011
"Andrej Mitrovic" <andrej.mitrovich@gmail.com> wrote in message news:mailman.631.1295038817.4748.digitalmars-d@puremagic.com...
>On 1/14/11, Nick Sabalausky <a@a.a> wrote:
>> import std.stdio;
>>
>> version(Windows)
>> {
>>     import std.c.windows.windows;
>>     extern(Windows) export BOOL SetConsoleOutputCP(UINT);
>> }
>>
>> void main()
>> {
>>     version(Windows) SetConsoleOutputCP(65001);
>>
>>     writeln("HuG says: Fukken Über Death Terminal");
>> }
>>
>
>Does that work for you? I get back:
>HuG says: Fukken Ãœber Death Terminal
>

Yea, it works for me (XP Pro SP2 32-bit), and my "chcp" is 437, not 65001. The NG or copy-paste might have messed it up. Try with a code-point escape sequence:

import std.stdio;

version(Windows)
{
    import std.c.windows.windows;
    extern(Windows) export BOOL SetConsoleOutputCP(UINT);
}

void main()
{
    version(Windows) SetConsoleOutputCP(65001);

    writeln("HuG says: Fukken \u00DCber Death Terminal");
}




January 14, 2011
Michel Fortin <michel.fortin@michelf.com> wrote:

> 
> 	mythical 7 with an umlaut: 7̈
> 	mythical 7 with umlaut, ring above, and acute accent: 7̈̊́
> 
> I can't guaranty your news reader will display the above correctly, but it works as described in mine (Unison on Mac OS X). In fact, it should work in all Cocoa-based applications. This probably includes iOS-based devices too, but I haven't tested there.
> 
All the examples given so far worked fine on my iPhone.

Gianluigi
January 14, 2011
On 1/14/11 7:50 AM, Michel Fortin wrote:
> On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>
>> On 1/13/11 7:09 PM, Michel Fortin wrote:
>>> That's forgetting that most of the time people care about graphemes
>>> (user-perceived characters), not code points.
>>
>> I'm not so sure about that. What do you base this assessment on? Denis
>> wrote a library that according to him does grapheme-related stuff
>> nobody else does. So apparently graphemes is not what people care
>> about (although it might be what they should care about).
>
> Apple implemented all these things in the NSString class in Cocoa. They
> did all this work on Unicode at the beginning of Mac OS X, at a time
> where making such changes wouldn't break anything.
>
> It's a hard thing to change later when you have code that depend on the
> old behaviour. It's a complicated matter and not so many people will
> understand the issues, so it's no wonder many languages just deal with
> code points.

That's a strong indicator, but we shouldn't get ahead of ourselves.

D took a certain risk by defaulting to Unicode at a time where the dominant extant systems languages left the decision to more or less exotic libraries, Java used UTF16 de jure but UCS2 de facto, and other languages were just starting to adopt Unicode.

I think that risk was justified because the relative loss in speed was often acceptable and the gains were there. Even so, there are people in this who protest against the loss in efficiency and argue that life is harder for ASCII users.

Switching to variable-length representation of graphemes as bundles of dchars and committing to that through and through will bring with it a larger hit in efficiency and an increased difficulty in usage. I agree that at a level that's the "right" thing to do, but I don't have yet the feeling that combining characters are a widely-adopted winner. For the most part, fonts don't support combining characters, and as a font dilettante I can tell that putting arbitrary sets of diacritics on top of characters is not what one should be doing as it'll look terrible. Unicode is begrudgingly acknowledging combining characters. Only a handful of libraries deal with them. I don't know how many applications need or care for them, versus how many applications do fine with precombined characters. I have trouble getting combining characters to combine on this machine in any of the applications I use - and this is a Mac.


Andrei
January 14, 2011
On 1/14/11, Nick Sabalausky <a@a.a> wrote:
> Try with a code-point escape sequence

Nope, I still get the same results (tried with different fonts, lucida etc.., but I don't think it's a font issue). Maybe I have my settings messed up or something.
January 14, 2011
"Andrej Mitrovic" <andrej.mitrovich@gmail.com> wrote in message news:mailman.633.1295044452.4748.digitalmars-d@puremagic.com...
> On 1/14/11, Nick Sabalausky <a@a.a> wrote:
>> Try with a code-point escape sequence
>
> Nope, I still get the same results (tried with different fonts, lucida etc.., but I don't think it's a font issue). Maybe I have my settings messed up or something.

Weird. Which version of windows are you on, and are you using the regular command line or powershell or something else? If you run "chcp 65001" from the cmd line first, does it work then?


January 14, 2011
Andrei Alexandrescu Wrote:

> That's a strong indicator, but we shouldn't get ahead of ourselves.
> 
> D took a certain risk by defaulting to Unicode at a time where the dominant extant systems languages left the decision to more or less exotic libraries, Java used UTF16 de jure but UCS2 de facto, and other languages were just starting to adopt Unicode.
> 
> I think that risk was justified because the relative loss in speed was often acceptable and the gains were there. Even so, there are people in this who protest against the loss in efficiency and argue that life is harder for ASCII users.
> 
> Switching to variable-length representation of graphemes as bundles of dchars and committing to that through and through will bring with it a larger hit in efficiency and an increased difficulty in usage. I agree that at a level that's the "right" thing to do, but I don't have yet the feeling that combining characters are a widely-adopted winner. For the most part, fonts don't support combining characters, and as a font dilettante I can tell that putting arbitrary sets of diacritics on top of characters is not what one should be doing as it'll look terrible. Unicode is begrudgingly acknowledging combining characters. Only a handful of libraries deal with them. I don't know how many applications need or care for them, versus how many applications do fine with precombined characters. I have trouble getting combining characters to combine on this machine in any of the applications I use - and this is a Mac.
> 
> 
> Andrei

Combining marks do need to be supported.
Some languages use combining marks extensively (see my other post) and of course font for those languages exist and they do support this. Mac doesn't support all languages so I'm unsure if it's the best example out there.
here's an example of the Hebrew bible:
http://www.scripture4all.org/OnlineInterlinear/Hebrew_Index.htm

Just look at the any of the PDFs there to see how Hebrew looks like with all sorts of different marks. In the same vain I could have found a Japanese text with ruby (where a Kanji letter has on top of it Hiragana text that tells you how to read it)

Using a dchar as a string element instead of a proper grapheme will make it really hard to work with texts in such languages.

Regarding efficiency concerns for ASCII users - there's no rule that forces us to have a single string type,  just look for comparison at how many integral types D has. I believe that the correct thing is to have a 'universal string' type be the default (just like int is for integral types) and provide additional types for other commonly useful encodings such as ASCII.

A geneticist for instance should use a 'DNA' type that encodes the four DNA letters instead of an ASCII string or even worse, a universal (Unicode) string.

January 14, 2011
On 1/14/11, Nick Sabalausky <a@a.a> wrote:
> Weird. Which version of windows are you on, and are you using the regular command line or powershell or something else? If you run "chcp 65001" from the cmd line first, does it work then?
>

Okay, it appears this is an issue with Console2. I'll have to report it to the dev, although he hasn't fixed much of anything for ages already. I'm really contemplating of writing my own shell by now. (no Linux jokes now, please. :p)

Works fine in cmd.exe, Lucida font without calling 65001 manually. In fact, it works with the 437 code page as well when I comment out SetConsoleOutputCP.
January 14, 2011
On 1/15/11, Andrej Mitrovic <andrej.mitrovich@gmail.com> wrote:
> fact, it works with the 437 code page as well when I comment out SetConsoleOutputCP.
>

Woops, let me revise what I've said:

If the code has the call to change the codepage, then I'll get back
the correct result in console.
If it doesn't, I have to switch the codepage manually.

I don't know what the problem with Console2 is, but if I change cmd.exe to always use a Lucida font then Console2 will output the correct result (even though I'm using fixedsys in Console2).

This is getting too specific and I don't want to hijack the thread. Everything is working fine now. Thx. :)
January 14, 2011
On 2011-01-14 17:04:08 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> On 1/14/11 7:50 AM, Michel Fortin wrote:
>> On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail@erdani.org> said:
>> 
>>> On 1/13/11 7:09 PM, Michel Fortin wrote:
>>>> That's forgetting that most of the time people care about graphemes
>>>> (user-perceived characters), not code points.
>>> 
>>> I'm not so sure about that. What do you base this assessment on? Denis
>>> wrote a library that according to him does grapheme-related stuff
>>> nobody else does. So apparently graphemes is not what people care
>>> about (although it might be what they should care about).
>> 
>> Apple implemented all these things in the NSString class in Cocoa. They
>> did all this work on Unicode at the beginning of Mac OS X, at a time
>> where making such changes wouldn't break anything.
>> 
>> It's a hard thing to change later when you have code that depend on the
>> old behaviour. It's a complicated matter and not so many people will
>> understand the issues, so it's no wonder many languages just deal with
>> code points.
> 
> That's a strong indicator, but we shouldn't get ahead of ourselves.
> 
> D took a certain risk by defaulting to Unicode at a time where the dominant extant systems languages left the decision to more or less exotic libraries, Java used UTF16 de jure but UCS2 de facto, and other languages were just starting to adopt Unicode.
> 
> I think that risk was justified because the relative loss in speed was often acceptable and the gains were there. Even so, there are people in this who protest against the loss in efficiency and argue that life is harder for ASCII users.

Then perhaps it's time we find out a way to handle non-Unicode encodings too. We can get away treating ASCII strings as Unicode strings because of a useful property of UTF-8, but should we really do this?

Also, it'd really help this discussion to have some hard numbers about the cost of decoding graphemes.


> Switching to variable-length representation of graphemes as bundles of dchars and committing to that through and through will bring with it a larger hit in efficiency and an increased difficulty in usage. I agree that at a level that's the "right" thing to do, but I don't have yet the feeling that combining characters are a widely-adopted winner. For the most part, fonts don't support combining characters, and as a font dilettante I can tell that putting arbitrary sets of diacritics on top of characters is not what one should be doing as it'll look terrible. Unicode is begrudgingly acknowledging combining characters. Only a handful of libraries deal with them. I don't know how many applications need or care for them, versus how many applications do fine with precombined characters. I have trouble getting combining characters to combine on this machine in any of the applications I use - and this is a Mac.

I'm using the character palette: Edit menu > Special Characters... from there you can insert arbitrary code points. Use the search function of the palette to get code points with "combining" in their names, then click the big character box on the lower left to insert them. Have fun!


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/