November 23, 2005
Oskar Linde wrote:
> Georg Wrede wrote:
> 
>> Regan Heath wrote:
>>
>>> On Wed, 23 Nov 2005 21:16:03 +0200, Georg Wrede <georg.wrede@nospam.org>  wrote:
>>>
>>>> A UTF-8 code unit can not be stored in char. Therefore UTF can't be  mentioned at all in this context.
>>>
>>> I suspect we're having terminology issues again.
>>>
>>> Regan
>>
>> Thanks! Of course!
>>
>> So "A UTF-8 code point".
> 
> No... Apparently code point is a point in the unicode space. I.e. a character. The guys that came up with that should be shot though. ;)
> 
> http://www.unicode.org/glossary/
> 
> There are no UTF-8 code points by terminology. Only
> 
> UTF-8 code unit == UTF-8 code value
> Unicode code point == Unicode character == Unicode symbol
> 
> :)
> 
> /Oskar

Aaaarrrghhhhh!  :-P
November 24, 2005
Georg Wrede wrote:

<snip>

> (You might take an average program, count the literal strings, and then time the execution. Figure out the stupidest choice between UTF[8..32] and the OS's native encoding, and then calculate the time for decoding. I'd say that this amounts to such peanuts that it really is not worth thinking about.)
> 
> If a programmer writes a 10kloc program, half of which is string literals, then maybe yes, it could make a measurable difference. OTOH, such a programmer's code probably has bigger performance issues (and others) anyhow.
>
The complexity of utf->utf conversions should be linear. I don't believe anyone wants to convert 200 kB strings while drawing a 3d scene etc.

IMHO this UTF-thing is a problem since some people here don't want to write their own libraries and would like the compiler to do everything for them.

I totally agree that we should have only one type of string structure. I write Unicode-compliant programs every week and have never used these wchar/dchar structures. What I would really like is a Unicode-stream class that could read/write valid Unicode text files.

>> The problem lies in the "" form, where you have not specified the encoding. The compiler will try to infer the encoding depending on context. This is a weakness i think should be fixed by:
>>
>> - "" is char[], but may be implicitly cast to dchar[] and wchar[]
> 
> 
> That would fix it technically. I'd vote yes.
> 
> The bigger thing to fix is the rest. (char not being code point, gotchas when having arrays of char which you then slice, etc. -- still have to be sorted out.)

I don't believe this is a big problem. This works correctly already:

foreach(wchar c; char[] unicodestring) { ... }

Walter should fix this so that to following would work too:

foreach(char c; char[] unicodestring) { ... }

BTW, should the sorting of char-arrays be a language or library feature?

>> What is the wrong with a well defined representation?
> 
> 
> First of all, if the machine is big endian or little endian. What if it wants to store them in an unexpected place? Maybe it's a CIA certified compiler that encrypts all literals in binaries and inserts on-the-fly runtime decoding routines in all string fetches? (I REALLY wouldn't be surprised to hear they have one.)
> 
Now wasn't that already the second conspiracy theory you have posted today... go get some sleep, man :)

> Having said that, if somebody asks me, I'd vote for UTF-8 on Linux, and whatever is appropriate on Windows. Then I could look for the strings in the binary file with the unix "strings" command.

UTF-8 on Linux. Maybe UTF-16 on Windows (performance)? OTOH, UTF-8 has serious advantages over UTF-7/16/32 [1], [2] (some of us appreciate other things than just raw performance)

[1] http://www.everything2.com/index.pl?node=UTF-8
[2] http://en.wikipedia.org/wiki/Utf-8
November 24, 2005
Oskar Linde wrote:
> Regan Heath wrote:
>> On Wed, 23 Nov 2005 13:18:49 -0800, Kris <fu@bar.com> wrote:
>>> "Oskar Linde" <oskar.lindeREM@OVEgmail.com> wrote
>>>> Kris wrote:
>>>> 
>>>>> There's the additional question as to whether the literal
>>>>> content should be used to imply the type (like literal chars
>>>>> and numerics do), but that's a different topic and probably
>>>>> not as important.
>>>> 
>>>> How would the content imply the type? All Unicode strings are representable equally by char[], wchar[] and dchar[]. Do you
>>>> mean that the most optimal (memory wise) encoding is used? Or
>>>> do you mean that file encoding could imply type? (This means
>>>> that transcoding the source code could change program behaviour.)
>>> 
>>> I meant in terms of implying the "default" type. We're suggesting
>>> that default be char[], but if the literal contained unicode
>>> chars then the default might be something else. This is the
>>> method used to assign type  to a char literal, but Walter has
>>> noted it might be confusing to do something similar with string
>>> literals.
> 
> I think I see what you mean. By unicode char, you actually mean
> unicode character with code point > 127 (i.e. not representable in
> ASCII).

I'd be willing to say that (see my other post in this thread, a couple of hours ago), the difference in execution time between having the string literal decoded from "suboptimal" to the needed type is negligible.

And deciding based on content just makes things complicated. Both the deciding, the storing, and the retrieving.

From that follows, that the time for Walter to do the code for that, is wasted. Much better to just store the literal in (whatever we, or actually Walter decide) just some type, and be done with it. At usage time it'd then get cast as needed, if needed. Simple.

>> Did you see my reply to you in this other thread .. wait a minute
>> .. where's it gone .. no wonder you didn't reply, it seems my post
>> was never  made. Let me try again: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.learn/2216
>> 
>> What do you two think of what I have there?
>> 
>>> As indicated, I don't think this aspect is of much importance compared to the issues associated with an uncommited literal type
>>> ~ that uncommited aspect needs to be fixed, IMO.
>> 
>> I agree. While there exists a slight risk in doing so, it's no different  to the risk involved with the current handling of
>> integer literals. This  change would essentailly make string
>> literal handling consistent with  integer literal handling.
> 
> Reasoning: The only risk always defaulting to char[] is in
> efficiency? Like the following scenario:
> 
> A user writes string literals, in say chineese, without a suffix,
> making those strings char[] by default. Those literals are fed to
> functions having multiple implementations, optimised for char[],
> wchar[] and dchar[]. Since the optimal encoding for chineese is
> UTF-16 (guessing here), the users code will be suboptimal.

Ok, let's have a vote:

Let's say we have 100 000 French names, and we want to sort them. We have the same names in a char[][], wchar[][] and a dchar[][].

Which is fastest to sort?

Say we have 100 000 Chinese names? Which is fastest?

What if we have 100 000 English names?


(Of course somebody will actually code and run this. But I'd like to hear it from folks _before_ that.)

November 24, 2005
Jari-Matti Mäkelä escribió:
> 
> Yes, that might be true if he's using Windows. On Linux the compiler input must be valid Unicode. Maybe it's hard to create Unicode-compliant 

The compiler input must also be valid Unicode on Windows. Just try to feed it with some invalid characters and you'll see how it'll complain.

> programs on Windows since even Sun Java has some problems with Unicode class names on Windows XP.
> 
> -- 
> 
> My point here was that since char[] is a fully valid UTF-8 string and the index/slice-operators are intelligent enough to work on the Unicode-symbol level [code 1], we should be able to store Unicode symbols to a char-variable as well. You see, an UTF-8 symbol requires 8-32 bits of storage space, so it would be perfectly possible to implement an UTF-8 symbol (char) using standard 32-bit integers or 1-4 x 8-bit bytes.
> 
> The current implementation seems to be always using 1 x 8-bit byte for the char-type and n x 8-bit bytes for char[] - strings. The strings work well but it's impossible to store a single UTF-8 character now.
> 
> [code 1]:
> 
>   char[] a = "∇∆∈∋";
>   writefln(a[2]);  //outputs: ∈ (a stupid implementation would output 0x87)


-- 
Carlos Santander Bernal
November 24, 2005
Jari-Matti Mäkelä wrote:
> 
> My point here was that since char[] is a fully valid UTF-8 string and the index/slice-operators are intelligent enough to work on the Unicode-symbol level [code 1], we should be able to store Unicode symbols to a char-variable as well. 

No this is wrong. The index/slice operations have no idea about Unicode/UTF.

> You see, an UTF-8 symbol requires 8-32 bits of storage space, so it would be perfectly possible to implement an UTF-8 symbol (char) using standard 32-bit integers or 1-4 x 8-bit bytes.
> 
> The current implementation seems to be always using 1 x 8-bit byte for the char-type and n x 8-bit bytes for char[] - strings. The strings work well but it's impossible to store a single UTF-8 character now.

There is no such thing as a "UTF-8 character" or "UTF-8 symbol"... "Unicode character" or "UTF-8 code unit"

UTF is merely the encoding. Unicode is the character set.

> [code 1]:
> 
>   char[] a = "∇∆∈∋";
>   writefln(a[2]);  //outputs: ∈ (a stupid implementation would output 0x87)

You really had me confused there for a moment (I had to test this). Fortunately you are wrong. On what platform with what source code encoding have you tested this? The following is on DMD 0.139 linux:

import std.stdio;

void main() {
  char[] a = "åäö";
  writef(a[1]);
}

Will print Error: 4invalid UTF-8 sequence

Which is correct. (The error message might be slightly confusing though)

import std.stdio;

void main() {
  char[] a = "åäö";
  writef(a[2..4]);
}

will print "ä"
Which is correct.

/Oskar
November 24, 2005
Oskar Linde wrote:
> Jari-Matti Mäkelä wrote:
> 
>>
>> My point here was that since char[] is a fully valid UTF-8 string and the index/slice-operators are intelligent enough to work on the Unicode-symbol level [code 1], we should be able to store Unicode symbols to a char-variable as well. 
> 
> 
> No this is wrong. The index/slice operations have no idea about Unicode/UTF.
> 
Sorry, I didn't test this enough. At least the foreach-statement is Unicode-aware. Anyway, these operations and a the char-type should know about Unicode.

>> You see, an UTF-8 symbol requires 8-32 bits of storage space, so it would be perfectly possible to implement an UTF-8 symbol (char) using standard 32-bit integers or 1-4 x 8-bit bytes.
>>
>> The current implementation seems to be always using 1 x 8-bit byte for the char-type and n x 8-bit bytes for char[] - strings. The strings work well but it's impossible to store a single UTF-8 character now.
> 
> 
> There is no such thing as a "UTF-8 character" or "UTF-8 symbol"... "Unicode character" or "UTF-8 code unit"

Ok, "UTF-8 symbol" -> "the UTF-8 encoded bytestream of an Unicode symbol"

>> [code 1]:
>>
>>   char[] a = "∇∆∈∋";
>>   writefln(a[2]);  //outputs: ∈ (a stupid implementation would output 0x87)
> 
> 
> You really had me confused there for a moment (I had to test this). Fortunately you are wrong. On what platform with what source code encoding have you tested this? The following is on DMD 0.139 linux:

Sorry again, didn't test this. This ought to work, but it doesn't.

> import std.stdio;
> 
> void main() {
>   char[] a = "åäö";
>   writef(a[2..4]);
> }
> 
> will print "ä"
> Which is correct.

Actually it isn't correct behavior. There's no real need to change the string on a byte-level. Think now, you don't have to change individual bits on a ASCII-string either. In 7-bit ASCII the base unit is a 7-bit byte, in ISO-8859-x it is a 8-bit byte and in UTF-8 it is 8-32 bits. The smallest unit you need is the base unit.
November 24, 2005
Jari-Matti Mäkelä wrote:
> Oskar Linde wrote:
> 
>> There is no such thing as a "UTF-8 character" or "UTF-8 symbol"... "Unicode character" or "UTF-8 code unit"
> 
> Ok, "UTF-8 symbol" -> "the UTF-8 encoded bytestream of an Unicode symbol"

According to terminology, it is "UTF-8 code unit" or "UTF-8 code value".

>>> [code 1]:
>>>
>>>   char[] a = "∇∆∈∋";
>>>   writefln(a[2]);  //outputs: ∈ (a stupid implementation would output 0x87)
>>
>>
>>
>> You really had me confused there for a moment (I had to test this). Fortunately you are wrong. On what platform with what source code encoding have you tested this? The following is on DMD 0.139 linux:
> 
> Sorry again, didn't test this. This ought to work, but it doesn't.
> 
>> import std.stdio;
>>
>> void main() {
>>   char[] a = "åäö";
>>   writef(a[2..4]);
>> }
>>
>> will print "ä"
>> Which is correct.
> 
> 
> Actually it isn't correct behavior. There's no real need to change the string on a byte-level. Think now, you don't have to change individual bits on a ASCII-string either. In 7-bit ASCII the base unit is a 7-bit byte, in ISO-8859-x it is a 8-bit byte and in UTF-8 it is 8-32 bits. The smallest unit you need is the base unit.

Its rather a matter of efficiency. Character indexing on a utf-8 array is O(n) (You can of course make it better, but with more complexity and higher memory requirements), while code unit indexing is O(1).

You actually very seldom need character indexing. Almost everything can be done on a code unit level.

/Oskar
November 24, 2005
Oskar Linde wrote:
> 
> void main() {
>   char[] a = "åäö";
>   writef(a[2..4]);
> }
> 
> will print "ä"
> Which is correct.

Hmmmmmmmmm. "Correct" is currently under intense debate here.
November 24, 2005
Georg Wrede wrote:
> Oskar Linde wrote:
> 
>>
>> void main() {
>>   char[] a = "åäö";
>>   writef(a[2..4]);
>> }
>>
>> will print "ä"
>> Which is correct.
> 
> 
> Hmmmmmmmmm. "Correct" is currently under intense debate here.

Yes, the biggest problem here is that some people don't like the O(n) complexity of the 'correct' UTF-8 indexing.

I'm not a string handling expert but I don't like the current implementation:

char[] a = "åäö";
writefln(a.length);	// outputs: 6


char[] a = "tässä  tekstiä";
std.string.insert(a, 6, "vähän");
writefln(a);		// outputs: tässä  tekstiä
November 24, 2005
Jari-Matti Mäkelä wrote:
> Yes, the biggest problem here is that some people don't like the O(n) complexity of the 'correct' UTF-8 indexing.

Of course O(n) is worse than O(k). Then again, that may not be such a problem with actual applications, since the total time used to this usually is (almost) negligible compared to the total runtime activity of the application.

And, if beta testing shows that the program is not "fast enough" with production data, then it is easy to profile a D program. If it turns out that precisely this causes the slowness, then one might use, say, UTF16 in those situations. No biggie. Or UTF32.

It's actually a shame that we haven't tried all of this stuff already. The wc.d example might be a good candidate to try the three UTF encodings and time the results. And other examples would be quite easy to write.

(Ha, a chance for newbies to become "D gurus": publish your results here!)
1 2 3
Next ›   Last »