View mode: basic / threaded / horizontal-split · Log in · Help
November 18, 2005
A couple of issues with UTF
Haa, look ma, no hands!

So we have implicit UTF conversion.
The following compiles ok!

############### FC4 terminal log copy:

$ cat outtest2fixed.d
import std.stream;

void main()
{
         char[] c = "Saatana perkele"c;
        wchar[] w = "Saatana perkele"w;

        File of1 = new File;
        File of2 = new File;

        of1.create("/tmp/1f.txt");
        of2.create("/tmp/2f.txt");

        of1.write(c);
        of1.write(c);
        of1.write(w);
        of1.write(w);

        of2.write(w);
        of2.write(w);
        of2.write(c);
        of2.write(c);

        of1.close();
        of2.close();
}
 $ hexdump -C /tmp/1f.txt
00000000  0f 00 00 00 53 61 61 74  61 6e 61 20 70 65 72 6b
|....Saatana perk|
00000010  65 6c 65 0f 00 00 00 53  61 61 74 61 6e 61 20 70
|ele....Saatana p|
00000020  65 72 6b 65 6c 65 0f 00  00 00 53 00 61 00 61 00
|erkele....S.a.a.|
00000030  74 00 61 00 6e 00 61 00  20 00 70 00 65 00 72 00
|t.a.n.a. .p.e.r.|
00000040  6b 00 65 00 6c 00 65 00  0f 00 00 00 53 00 61 00 
|k.e.l.e.....S.a.|
00000050  61 00 74 00 61 00 6e 00  61 00 20 00 70 00 65 00
|a.t.a.n.a. .p.e.|
00000060  72 00 6b 00 65 00 6c 00  65 00
|r.k.e.l.e.|
 $ hexdump -C /tmp/2f.txt
00000000  0f 00 00 00 53 00 61 00  61 00 74 00 61 00 6e 00
|....S.a.a.t.a.n.|
00000010  61 00 20 00 70 00 65 00  72 00 6b 00 65 00 6c 00
|a. .p.e.r.k.e.l.|
00000020  65 00 0f 00 00 00 53 00  61 00 61 00 74 00 61 00
|e.....S.a.a.t.a.|
00000030  6e 00 61 00 20 00 70 00  65 00 72 00 6b 00 65 00
|n.a. .p.e.r.k.e.|
00000040  6c 00 65 00 0f 00 00 00  53 61 61 74 61 6e 61 20
|l.e.....Saatana |
00000050  70 65 72 6b 65 6c 65 0f  00 00 00 53 61 61 74 61
|perkele....Saata|
00000060  6e 61 20 70 65 72 6b 65  6c 65
|na perkele|
 $ cat /tmp/1f.txt
Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
 $ cat /tmp/2f.txt
Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
 $

########################### end of log copy.

HOWEVER, I have a couple of issues here.

First of all, it looks like we don't have implicit conversion, but 
rather that the strings get copied to the output stream byte by byte!

Now, the standard says that you are not allowed to have illegal octets, 
or characters in a UTF file. Of any width.

Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same file!!!!!

Further, seems like write puts the BOM before every string. That is 
definitely illegal.

(The operating system let me "cat" the files to screen, and tried its 
best to show them in a reasonable way (as you see above). But it really 
would not have had to.)

---

What we could have happen is, that the first string output to the 
stream, causes the stream to choose the stream UTF width (and 
theoretically the endianness, too). (This is what the OS does when 
choosing whether to open in byte width or wider, according to linux 
documentation.) And whenever somebody tries to stuff "the wrong" crap 
there, do either of the following:

 - implicitly convert the string to the right UTF
 - throw error

---

While D is in pre-1.0, I think we should at first decide that streams 
have to be opened with the UTF specified. Since the compiler should know 
the type of all the strings (see my other post today), it can then 
insert code for the appropriate runtime conversion.

Since the compiler knows the type of string, it might be suggested that 
the first output string defines the stream type.

I think it would be unwise. But _only_ for the same reason D demands a 
default in case, denies a semicolon right after an if clause, etc.

That is, to help the programmer not to shoot his foot. There is _no_ 
valid reason why it couldn't be set by the first string automatically.

HOWEVER, good table manners ask for reasonable defaults where at all 
possible. Such a default would be the UTF width and endianness that is 
"natural" on the particular platform.

(If D is ever ported to platfrom that doesn't handle UTF, then the 
Natural Default of course is None. That is, one has to manually choose 
when opening the stream.)

---

Similarly, if we want to implement our INPUT streams correctly, then 
they should _definitively_ choose their UTF type before the first time 
the application gets to read from the stream.

FOR THE SITUATIONS where one either has to already process the first 
octet before enough of the stream has been seen to know which UTF type 
it is, THEN in THAT CASE an input stream of e.g. UBYTE should be 
mandatory to use instead. Or more to the point, UTF streams should not 
be used then.

---

I have to remark on "since the compiler knows the type of string" above. 
Since this is such rocket science, DO REMEMBER that it "knows" because 
it looks at the TYPE (as in char[], wchar[], dchar[]) and not the 
CONTENTS of the string at that time.

:-) Just to keep apples and oranges i order...

---

What I called BOM above, does incidentally not look like it should, in 
the above file dumps anyway.

---

Before we continue, I think everybody should read the following:

www.unicode.org/faq/

                            -- ** --
November 18, 2005
Re: A couple of issues with UTF
Georg Wrede wrote:
<snip>
>  $ cat /tmp/1f.txt
> Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
>  $ cat /tmp/2f.txt
> Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
>  $
> 
> ########################### end of log copy.
> 
> HOWEVER, I have a couple of issues here.

I think you have some serious issues with the political correctness of 
the message here ;)
November 18, 2005
Re: A couple of issues with UTF
Georg Wrede wrote:
<snip>
> HOWEVER, I have a couple of issues here.
> 
> First of all, it looks like we don't have implicit conversion, but 
> rather that the strings get copied to the output stream byte by byte!

write(...) writes the source value to the stream byte by byte.

> 
> Now, the standard says that you are not allowed to have illegal octets, 
> or characters in a UTF file. Of any width.
> 
> Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same 
> file!!!!!
> 
> Further, seems like write puts the BOM before every string. That is 
> definitely illegal.
> 

That is illegal if you're trying to create a valid _text_ file. AFAIK 
the normal File is just a regular OutputStream, it doesn't care about UTF.

> What we could have happen is, that the first string output to the 
> stream, causes the stream to choose the stream UTF width (and 
> theoretically the endianness, too). (This is what the OS does when 
> choosing whether to open in byte width or wider, according to linux 
> documentation.) And whenever somebody tries to stuff "the wrong" crap 
> there, do either of the following:
> 
>  - implicitly convert the string to the right UTF
>  - throw error

I think this should not be the default for all streams. Maybe it would 
be better to have a new TextStream class that supports full Unicode?
November 18, 2005
Re: A couple of issues with UTF
Jari-Matti Mäkelä wrote:
> Georg Wrede wrote:
> <snip>
> 
>>  $ cat /tmp/1f.txt
>> Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
>>  $ cat /tmp/2f.txt
>> Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
>>  $
>>
>> ########################### end of log copy.
>>
>> HOWEVER, I have a couple of issues here.
> 
> 
> I think you have some serious issues with the political 
> correctness of the message here ;)

ROFL !

I trust the non-Finns use inborn Duck Typing. If it doesn't look like 
obscenities, then it isn't.   :-)

Or maybe I have an encryptor that turns "Hail Mary" into that string.

Or maybe repeatedly drawing a picture out of iron wire has my hands 
bleeding, and I'm getting pissed off here. Maybe I should switch to clay 
models...

But hey, it was USASCII all over!
November 18, 2005
Re: A couple of issues with UTF
Jari-Matti Mäkelä wrote:
> Georg Wrede wrote:
> <snip>
> 
>> HOWEVER, I have a couple of issues here.
>>
>> First of all, it looks like we don't have implicit conversion, but 
>> rather that the strings get copied to the output stream byte by byte!
> 
> 
> write(...) writes the source value to the stream byte by byte.

Oops.

Well, in that case, we should give it uchar[] when we don't want 
fanciness. Or void[], right!

Which should make it EITHER illegal to write [c/w/d]char[] to it -- OR 
we should have different kinds of streams. Some of which would be UTF 
savvy, some text, some void streams.

>> Now, the standard says that you are not allowed to have illegal 
>> octets, or characters in a UTF file. Of any width.
>>
>> Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same 
>> file!!!!!
>>
>> Further, seems like write puts the BOM before every string. That is 
>> definitely illegal.
> 
> That is illegal if you're trying to create a valid _text_ file. AFAIK
> the normal File is just a regular OutputStream, it doesn't care about
> UTF.

We should have a set of different streams. Hey, Java has like millions 
to choose from! You can even join them to get, say, a "buffered, 
character-code-translating, rot-13, foo-izing" stream!!!

>> What we could have happen is, that the first string output to the 
>> stream, causes the stream to choose the stream UTF width (and 
>> theoretically the endianness, too). (This is what the OS does when 
>> choosing whether to open in byte width or wider, according to linux 
>> documentation.) And whenever somebody tries to stuff "the wrong" crap 
>> there, do either of the following:
>>
>>  - implicitly convert the string to the right UTF
>>  - throw error
> 
> I think this should not be the default for all streams. Maybe it would 
> be better to have a new TextStream class that supports full Unicode?

Of course!
November 18, 2005
Re: A couple of issues with UTF
Georg Wrede wrote:
> Further, seems like write puts the BOM before every string. That is 
> definitely illegal.
> 
...
> 
> What I called BOM above, does incidentally not look like it should, in 
> the above file dumps anyway.
> 
Because that's not the BOM, it's (an int with) the string length...

-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to 
be... unnatural."
November 23, 2005
Re: A couple of issues with UTF
Georg Wrede wrote:
> Jari-Matti Mäkelä wrote:
>> Georg Wrede wrote: <snip>
>> 
>>> HOWEVER, I have a couple of issues here.
>>> 
>>> First of all, it looks like we don't have implicit conversion,
>>> but rather that the strings get copied to the output stream byte
>>> by byte!
>> 
>> write(...) writes the source value to the stream byte by byte.
> 
> Oops.
> 
> Well, in that case, we should give it uchar[] when we don't want 
> fanciness. Or void[], right!
> 
> Which should make it EITHER illegal to write [c/w/d]char[] to it --
> OR we should have different kinds of streams. Some of which would be
> UTF savvy, some text, some void streams.
> 
>>> Now, the standard says that you are not allowed to have illegal 
>>> octets, or characters in a UTF file. Of any width.
>>> 
>>> Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the
>>> same file!!!!!
>>> 
>>> Further, seems like write puts the BOM before every string. That
>>> is definitely illegal.
>> 
>> That is illegal if you're trying to create a valid _text_ file.
>> AFAIK the normal File is just a regular OutputStream, it doesn't
>> care about UTF.
> 
> We should have a set of different streams. Hey, Java has like
> millions to choose from! You can even join them to get, say, a
> "buffered, character-code-translating, rot-13, foo-izing" stream!!!

At first the Java style where one chains streams seemed terribly 
inefficient. But later I understood that it wasn't, it just looked like 
inefficient. We could have raw input and output streams, and then a set 
of conversion streams (or actually filters), like this:

OutStream os = new OutStream("foo");  // opens a raw outstream
StreamBuffer sb = new StreamBuffer(os);
ConvStream out = new ConvStream(UTF8, ISO8859-15, sb);
...
char[] mytext = "kjsldkfjlskdfjslkd";
fwritefln(out, mytext);

Since StreamBuffer eventually outputs everything, one doesn't even have 
to worry about the buffer getting filled up in "mid-char" if doing 
output in UTF (not the example above), since the rest of the char gets 
output later anyhow.

I think this looks clean and easy to maintain (for the library 
maintainer), and it's use is starightforward, flexible, and coceptually 
clear.

This would also bring tighter locality to the whole input/output system, 
since every stream only does its own thing.

With this setup it also becomes much easier for the programmer to write 
his own stream filters, without having to become a Stream Guru first.
Top | Discussion index | About this forum | D home