November 15, 2005

Oskar Linde wrote:
> In article <43790E73.1080704@nospam.org>, Georg Wrede says...
> 
>>>I think we've painted ourselves in the corner by calling the UTF-8 entity "char"!!
>>
>>Given:
>>
>>    char[10] foo;
>>
>>What is the storage capacity of foo?
>>
>> - is it 10 UTF-8 characters
>> - is it 2.5 UTF-8 characters
>> - "it depends"
>> - something else
> 
> 
> It is obviously 10 UTF-8 codepoints. 10 bytes of memory gets reserved.
> Nothing else makes sense. There is no such thing as a UTF-8
> character. There are unicode characters, which are encoded by 1-4
> UTF-8 codepoints.

According to D docs there is!

Which is yet another reason to figure out how to redo all of this UTF related in D.

But I do understand why this is such a mess. Had Walter dug into the bottom of this, then the rest of D would be a lot less than it is today. And since the current way of things has worked for most things, the priority hasn't been exactly top.

Which I'm happy about, since now we are doing good with the template system, which (currently) is a lot more important.

> 
>>Another question:
>>
>>How much storage capacity should I allocate here:
>>
>>   char[?] bar = "Äiti syö lettuja.";
>>      // Finnish for "Mother eats pancakes." :-)
>>   char[?] baz = "ï®”ï¯›ïºŒï¯”ïº¡ïº ï®—ï®ï®±ïº¼ïº¶";
>>      // I hope the al Qaida or the CIA don't kock on my door,
>>      // I hereby officially state I have no idea what I wrote.
> 
> 
> If you don't want the extra overhead of a dynamic char[], auto will help you here:
> 
> auto baz = "ï®”ï¯›ïºŒï¯”ïº¡ïº ï®—ï®ï®±ïº¼ïº¶";
> 
> writef("%s\n",typeid(typeof(baz))); // will print char[33]
> 
> This string will always be a char[33], no matter if the file encoding is
> UTF-8, UTF-16 or UTF-32.

That's true. Irrespectively of whether the Arabic string is a "bla"c, "bla"w or "bla"d that gets ultimately stored in a char[] it'll be that same 33.

If one stores it in wchar[] it's something else. (Not bothering to find out, I'd guess 11. That because I originally constructed it on a Windows machine which people say does 16-bit UTF, and knowing Bill's way with software, I'd say that any entity needing more than those 16 bits would be unaccessible!)

And in a dchar[] it would definitely be 11.

>>(Heh, btw, upon writing that string, my cursor started looking weird. Now I'll just have to see whether my news reader or windows itself crashes first!! Which should never happen because I got the characters "legally", i.e. from the windows Character Map in the System Tools menu.)
> 
> Unicode support is hardly mature everywhere yet... :)

You can say that again! All the replies to the original Arabic string post had the string shown correctly, except yours!   ;-)

What machine/OS/program did you write this with?

Your's didn't even handle the Finnish string!
November 15, 2005
Georg Wrede wrote:


> Here's one you'd love: Guess what the following prints:
> 
>     char[] baz = "ﮔﯛﺌﯔﺡﺠﮗﮝﮱﺼﺶ";
>     len = baz.length;
>     writefln(len);
> 
> Another one: did you know that it is impossible on some machines to
> select that line from the beginning up until the middle of the Arabic
> string?

From what Thunderbird does, I'd say that's because it's actually written right-to-left :)


> Likewise, on some others, copying and pasting that string results in it
> getting backward.

Ditto..

>> I believe there is a rule that shortest possible one is deemed 'correct' for purposes like ours shown above, but if you obtain the strings from a file say, you may encounter the same 'string' represented in UTF-8 but find that the char[] are different lengths.
> 
> 
> Aarrghh, and how am I supposed to figure out which of a number of ways would be the shortest possible? Even worse, if there is a "ready made" character for it, how would I know? And what about string lookup?

Afaik, all possible forms are correct, but there exist 4 normalized forms, one of which always uses the single characters, if available, and another of which always uses expanded char sequences, if available (I don't remember how the other two differ).

Note that they're the same thing only when properly processed and displayed; in a programming language, I'd say the string is in whatever form the user wrote it in the source and there is actually no choice to be made by the compiler (and/or spec).

> $ man utf-8
> 
> on my machine says Unicode 3.1 compatible programs are _required_ to not accept the non-shortest forms -- for security reasons!

I think you misread that - the security thing applies to encoding characters in more UTF8 codepoints than neccessary (for example, if you encode a space character in 4 bytes instead of 1). This allows string comparisons using simple byte scanning, instead of always having to decode into characters..


xs0
November 15, 2005
xs0 wrote:
> Georg Wrede wrote:
> 
>> Here's one you'd love: Guess what the following prints:
>>
>>     char[] baz = "ﮔﯛﺌﯔﺡﺠﮗﮝﮱﺼﺶ";
>>     len = baz.length;
>>     writefln(len);
>>
>> Another one: did you know that it is impossible on some machines to
>> select that line from the beginning up until the middle of the Arabic
>> string?
> 
> From what Thunderbird does, I'd say that's because it's actually written right-to-left :)
> 
>> Likewise, on some others, copying and pasting that string results in it
>> getting backward.
> 
> Ditto..

I know. Makes creating text editors a bit harder.

>>> I believe there is a rule that shortest possible one is deemed 'correct' for purposes like ours shown above, but if you obtain the strings from a file say, you may encounter the same 'string' represented in UTF-8 but find that the char[] are different lengths.
>>
>> Aarrghh, and how am I supposed to figure out which of a number of ways would be the shortest possible? Even worse, if there is a "ready made" character for it, how would I know? And what about string lookup?
> 
> Afaik, all possible forms are correct, but there exist 4 normalized forms, one of which always uses the single characters, if available, and another of which always uses expanded char sequences, if available (I don't remember how the other two differ).

I'd probably have to have a huge table, or then maybe there is an algorithm that tells me whether a particular form is non-shortest.

>> my machine says Unicode 3.1 compatible programs are _required_ to not accept the non-shortest forms -- for security reasons!
> 
> 
> I think you misread that - the security thing applies to encoding characters in more UTF8 codepoints than neccessary (for example, if you encode a space character in 4 bytes instead of 1). This allows string comparisons using simple byte scanning, instead of always having to decode into characters..

Read for yourself. The following is pasted from Fedora Core 4, man utf-8 command.


SECURITY

The Unicode and UCS standards require that producers of UTF-8 shall use
the shortest form possible, e.g., producing a  two-byte  sequence  with
first  byte 0xc0 is non-conforming.  Unicode 3.1 has added the require-
ment that conforming programs must not  accept  non-shortest  forms  in
their input. This is for security reasons: if user input is checked for
possible security violations, a program might check only for the  ASCII
version  of  "/../" or ";" or NUL and overlook that there are many non-
ASCII ways to represent these things in a non-shortest UTF-8  encoding.
November 15, 2005
kris wrote:
> 
> I had wondered idly whether the pragma/whatever might be set within the class/struct/interface mechanism? Kind of like an attribute? The notion being that the class could indicate what its intentions were, regarding literal behavior (both elemental and [] varieties) for its contained methods/templates/mixins etc ~ as intended by the developer.

All this just to avoid a one character suffix on string literals?  :-)


Sean
November 15, 2005
In article <4379E5B1.5090505@nospam.org>, Georg Wrede says...
>Oskar Linde wrote:
>> In article <43790E73.1080704@nospam.org>, Georg Wrede says...
>>>> I think we've painted ourselves in the corner by calling the UTF-8 entity >"char"!!
>>>
>>>
>>> Given:
>>>
>>>    char[10] foo;
>>>
>>> What is the storage capacity of foo?
>>>
>>> - is it 10 UTF-8 characters
>>> - is it 2.5 UTF-8 characters
>>> - "it depends"
>>> - something else
>>
>>
>>
>> It is obviously 10 UTF-8 codepoints. 10 bytes of memory gets reserved. Nothing else makes sense. There is no such thing as a UTF-8 character. There are unicode characters, which are encoded by 1-4 UTF-8 codepoints.

>According to D docs there is!

Then the D doc is wrong. :)

>Which is yet another reason to figure out how to redo all of this UTF related in D.

What is fundamentally broken? (On a language level, not Phobos)

Sure, type-less string literals whose type gets inferred by usage is
a hack that is inconsistent with other parts of the language, but this can
be fixed by another (lesser?) hack:
Giving them a well-defined type (static char[lenght] for instance) that
is then allowed to be implicitly converted (with transcoding) to wchar[] and
dchar[]. (This transcoding could then be done at compile-time by the compiler.)

Should reencodings be done by implicit type conversion between the different
{,d,w}char[] types? I don't think so. Implicit type conversions are generally
considered evil. Especially conversions that take a non-trivial amount
of time and allocates memory. It is easy enough to explicitly append a
toUTF16() where conversion is needed.

Where more power is needed, a string class is the obvious choice. Such a class could cache transcodings, handle locale-dependent comparisons and manipulations, etc. Nothing in D prevents such a class.

>Oskar Linde wrote:
>> Unicode support is hardly mature everywhere yet... :)
>
>You can say that again! All the replies to the original Arabic string post had the string shown correctly, except yours!   ;-)
>
>What machine/OS/program did you write this with?

I used the http-nttp-gateway on www.digitalmars.com from my computer at work
(WinXP/Firefox).

>Your's didn't even handle the Finnish string!

I blame it on the web-interface :). But I've grown quite proficient at reading
some misinterpreted UTF-8 as Latin1, so here it is in ISO-8851-1 (I guess):
"Äiti syö lettuja."
The web-interface seems to disregard any notion of charsets/encodings.

Regards

Oskar


November 15, 2005
kris wrote:
> Walter Bright wrote:
>> I understand the issue. The best I could come up with is the c, w and d
>> suffixes in the cases where the type cannot be inferred.
>>
>> Other ways to deal with it:
>>
>> 1) instead of overloading the functions by parameter type, change the names
>> as in:
>>
>>     putc(char[])
>>     putw(wchar[])
>>     putd(dchar[])
> 
> Doesn't work for opX aliases, such as opShl() for C++ compatability :-(

Ack!  Perhaps this is a sign that C++ style i/o isn't particularly suited for D.  I'll admit I have a bit of a love/hate relationship with that syntax anyway--the only reason I tend to use stream-based i/o over printf in C++ is for type safety and UDT support, and we get both for free with writef in D.  Supporting the syntax has merit for code portability reasons, but possibly not as the primary native i/o syntax.

>> 2) don't provide the overload, and rely on transcoding
> 
> That's perhaps fine for single char instances, but it won't fly for anything other than trivial arrays :-(

It's slow perhaps, but are there any other arguments against it?


Sean
November 15, 2005
"kris" <fu@bar.org> wrote in message news:dlc9iv$13n4$1@digitaldaemon.com...
> Doesn't work for opX aliases, such as opShl() for C++ compatability :-(

That's true, though I'd argue that emulating C++ iostreams' use of << and >> needs to die <g>.


November 15, 2005
"Georg Wrede" <georg.wrede@nospam.org> wrote in message news:43779E34.6070205@nospam.org...
> Then I tried to compile each of the files. UTF-7 produced a C like slew of errors, and upon looking at the file with "less" I found out it was full of extra crap. (The file itself was ok, but UTF-7 is not for us.)

Right, UTF-7 is not supported by the compiler.


November 15, 2005
"Regan Heath" <regan@netwin.co.nz> wrote in message news:opsz7ar0ol23k2f5@nrage.netwin.co.nz...
> This is what was confusing me. I would have expected the line above to print in UTF-32. The only explaination I can think of is that the output stream is converting to UTF-8. In fact, I find it quite likely.
>
> If you want to write a UTF-32 file you can do so by using "write" as opposed to "writeString" on a dchar array, I believe.

The UTF encoding of the output of writef is determined by the "orientation" of the stdout stream (set/read by std.c.fwide).


November 15, 2005
"Sean Kelly" <sean@f4.ca> wrote ...
> kris wrote:
>> That's perhaps fine for single char instances, but it won't fly for anything other than trivial arrays :-(
>
> It's slow perhaps, but are there any other arguments against it?

'Slow' should more than enough argument where raw IO is concerned <g>.

Put it this way: by supporting only

write (dchar[])

then you avoid the need for literal casting or postfixing, and the literals will all be stored (at compile time, one would hope) as dchar arrays. That is, one would expect the suffix to actually dictate a storage-class rather than a cast?

However, this also means that all non-static char[] and wchar[] instances will be converted "on the fly", regardless of how large they might be, or how they might be stored at the back-end. To 'enforce' that upon a design should be considered a criminal waste of bandwidth & horsepower <g> Transcoding is just not a practical answer, I'm afraid.


3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19