A couple of issues with UTF

Nov 18, 2005

Georg Wrede

Nov 18, 2005

Jari-Matti Mäkelä

Nov 18, 2005

Nov 18, 2005

Nov 18, 2005

Nov 23, 2005

Nov 18, 2005

November 18, 2005

A couple of issues with UTF

Posted by Georg Wrede

Permalink

Georg Wrede

Permalink

Haa, look ma, no hands!

So we have implicit UTF conversion.
The following compiles ok!

############### FC4 terminal log copy:

$ cat outtest2fixed.d
import std.stream;

void main()
{
         char[] c = "Saatana perkele"c;
        wchar[] w = "Saatana perkele"w;

        File of1 = new File;
        File of2 = new File;

        of1.create("/tmp/1f.txt");
        of2.create("/tmp/2f.txt");

        of1.write(c);
        of1.write(c);
        of1.write(w);
        of1.write(w);

        of2.write(w);
        of2.write(w);
        of2.write(c);
        of2.write(c);

        of1.close();
        of2.close();
}
 $ hexdump -C /tmp/1f.txt
00000000  0f 00 00 00 53 61 61 74  61 6e 61 20 70 65 72 6b
|....Saatana perk|
00000010  65 6c 65 0f 00 00 00 53  61 61 74 61 6e 61 20 70
|ele....Saatana p|
00000020  65 72 6b 65 6c 65 0f 00  00 00 53 00 61 00 61 00
|erkele....S.a.a.|
00000030  74 00 61 00 6e 00 61 00  20 00 70 00 65 00 72 00
|t.a.n.a. .p.e.r.|
00000040  6b 00 65 00 6c 00 65 00  0f 00 00 00 53 00 61 00 |k.e.l.e.....S.a.|
00000050  61 00 74 00 61 00 6e 00  61 00 20 00 70 00 65 00
|a.t.a.n.a. .p.e.|
00000060  72 00 6b 00 65 00 6c 00  65 00
|r.k.e.l.e.|
 $ hexdump -C /tmp/2f.txt
00000000  0f 00 00 00 53 00 61 00  61 00 74 00 61 00 6e 00
|....S.a.a.t.a.n.|
00000010  61 00 20 00 70 00 65 00  72 00 6b 00 65 00 6c 00
|a. .p.e.r.k.e.l.|
00000020  65 00 0f 00 00 00 53 00  61 00 61 00 74 00 61 00
|e.....S.a.a.t.a.|
00000030  6e 00 61 00 20 00 70 00  65 00 72 00 6b 00 65 00
|n.a. .p.e.r.k.e.|
00000040  6c 00 65 00 0f 00 00 00  53 61 61 74 61 6e 61 20
|l.e.....Saatana |
00000050  70 65 72 6b 65 6c 65 0f  00 00 00 53 61 61 74 61
|perkele....Saata|
00000060  6e 61 20 70 65 72 6b 65  6c 65
|na perkele|
 $ cat /tmp/1f.txt
Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
 $ cat /tmp/2f.txt
Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
 $

########################### end of log copy.

HOWEVER, I have a couple of issues here.

First of all, it looks like we don't have implicit conversion, but rather that the strings get copied to the output stream byte by byte!

Now, the standard says that you are not allowed to have illegal octets, or characters in a UTF file. Of any width.

Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same file!!!!!

Further, seems like write puts the BOM before every string. That is definitely illegal.

(The operating system let me "cat" the files to screen, and tried its best to show them in a reasonable way (as you see above). But it really would not have had to.)

---

What we could have happen is, that the first string output to the stream, causes the stream to choose the stream UTF width (and theoretically the endianness, too). (This is what the OS does when choosing whether to open in byte width or wider, according to linux documentation.) And whenever somebody tries to stuff "the wrong" crap there, do either of the following:

 - implicitly convert the string to the right UTF
 - throw error

---

While D is in pre-1.0, I think we should at first decide that streams have to be opened with the UTF specified. Since the compiler should know the type of all the strings (see my other post today), it can then insert code for the appropriate runtime conversion.

Since the compiler knows the type of string, it might be suggested that the first output string defines the stream type.

I think it would be unwise. But _only_ for the same reason D demands a default in case, denies a semicolon right after an if clause, etc.

That is, to help the programmer not to shoot his foot. There is _no_ valid reason why it couldn't be set by the first string automatically.

HOWEVER, good table manners ask for reasonable defaults where at all possible. Such a default would be the UTF width and endianness that is "natural" on the particular platform.

(If D is ever ported to platfrom that doesn't handle UTF, then the Natural Default of course is None. That is, one has to manually choose when opening the stream.)

---

Similarly, if we want to implement our INPUT streams correctly, then they should _definitively_ choose their UTF type before the first time the application gets to read from the stream.

FOR THE SITUATIONS where one either has to already process the first octet before enough of the stream has been seen to know which UTF type it is, THEN in THAT CASE an input stream of e.g. UBYTE should be mandatory to use instead. Or more to the point, UTF streams should not be used then.

---

I have to remark on "since the compiler knows the type of string" above. Since this is such rocket science, DO REMEMBER that it "knows" because it looks at the TYPE (as in char[], wchar[], dchar[]) and not the CONTENTS of the string at that time.

:-) Just to keep apples and oranges i order...

---

What I called BOM above, does incidentally not look like it should, in the above file dumps anyway.

---

Before we continue, I think everybody should read the following:

www.unicode.org/faq/

                            -- ** --

Georg Wrede wrote: <snip> > $ cat /tmp/1f.txt > Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele > $ cat /tmp/2f.txt > Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele > $ > > ########################### end of log copy. > > HOWEVER, I have a couple of issues here. I think you have some serious issues with the political correctness of the message here ;)

Georg Wrede wrote: <snip> > HOWEVER, I have a couple of issues here. > > First of all, it looks like we don't have implicit conversion, but rather that the strings get copied to the output stream byte by byte! write(...) writes the source value to the stream byte by byte. > > Now, the standard says that you are not allowed to have illegal octets, or characters in a UTF file. Of any width. > > Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same file!!!!! > > Further, seems like write puts the BOM before every string. That is definitely illegal. > That is illegal if you're trying to create a valid _text_ file. AFAIK the normal File is just a regular OutputStream, it doesn't care about UTF. > What we could have happen is, that the first string output to the stream, causes the stream to choose the stream UTF width (and theoretically the endianness, too). (This is what the OS does when choosing whether to open in byte width or wider, according to linux documentation.) And whenever somebody tries to stuff "the wrong" crap there, do either of the following: > > - implicitly convert the string to the right UTF > - throw error I think this should not be the default for all streams. Maybe it would be better to have a new TextStream class that supports full Unicode?

Jari-Matti Mäkelä wrote: > Georg Wrede wrote: > <snip> > >> $ cat /tmp/1f.txt >> Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele >> $ cat /tmp/2f.txt >> Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele >> $ >> >> ########################### end of log copy. >> >> HOWEVER, I have a couple of issues here. > > > I think you have some serious issues with the political correctness of the message here ;) ROFL ! I trust the non-Finns use inborn Duck Typing. If it doesn't look like obscenities, then it isn't. :-) Or maybe I have an encryptor that turns "Hail Mary" into that string. Or maybe repeatedly drawing a picture out of iron wire has my hands bleeding, and I'm getting pissed off here. Maybe I should switch to clay models... But hey, it was USASCII all over!

Jari-Matti Mäkelä wrote: > Georg Wrede wrote: > <snip> > >> HOWEVER, I have a couple of issues here. >> >> First of all, it looks like we don't have implicit conversion, but rather that the strings get copied to the output stream byte by byte! > > > write(...) writes the source value to the stream byte by byte. Oops. Well, in that case, we should give it uchar[] when we don't want fanciness. Or void[], right! Which should make it EITHER illegal to write [c/w/d]char[] to it -- OR we should have different kinds of streams. Some of which would be UTF savvy, some text, some void streams. >> Now, the standard says that you are not allowed to have illegal octets, or characters in a UTF file. Of any width. >> >> Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same file!!!!! >> >> Further, seems like write puts the BOM before every string. That is definitely illegal. > > That is illegal if you're trying to create a valid _text_ file. AFAIK > the normal File is just a regular OutputStream, it doesn't care about > UTF. We should have a set of different streams. Hey, Java has like millions to choose from! You can even join them to get, say, a "buffered, character-code-translating, rot-13, foo-izing" stream!!! >> What we could have happen is, that the first string output to the stream, causes the stream to choose the stream UTF width (and theoretically the endianness, too). (This is what the OS does when choosing whether to open in byte width or wider, according to linux documentation.) And whenever somebody tries to stuff "the wrong" crap there, do either of the following: >> >> - implicitly convert the string to the right UTF >> - throw error > > I think this should not be the default for all streams. Maybe it would be better to have a new TextStream class that supports full Unicode? Of course!

Georg Wrede wrote: > Further, seems like write puts the BOM before every string. That is definitely illegal. > ... > > What I called BOM above, does incidentally not look like it should, in the above file dumps anyway. > Because that's not the BOM, it's (an int with) the string length... -- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."

Georg Wrede wrote: > Jari-Matti Mäkelä wrote: >> Georg Wrede wrote: <snip> >> >>> HOWEVER, I have a couple of issues here. >>> >>> First of all, it looks like we don't have implicit conversion, >>> but rather that the strings get copied to the output stream byte >>> by byte! >> >> write(...) writes the source value to the stream byte by byte. > > Oops. > > Well, in that case, we should give it uchar[] when we don't want fanciness. Or void[], right! > > Which should make it EITHER illegal to write [c/w/d]char[] to it -- > OR we should have different kinds of streams. Some of which would be > UTF savvy, some text, some void streams. > >>> Now, the standard says that you are not allowed to have illegal octets, or characters in a UTF file. Of any width. >>> >>> Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the >>> same file!!!!! >>> >>> Further, seems like write puts the BOM before every string. That >>> is definitely illegal. >> >> That is illegal if you're trying to create a valid _text_ file. >> AFAIK the normal File is just a regular OutputStream, it doesn't >> care about UTF. > > We should have a set of different streams. Hey, Java has like > millions to choose from! You can even join them to get, say, a > "buffered, character-code-translating, rot-13, foo-izing" stream!!! At first the Java style where one chains streams seemed terribly inefficient. But later I understood that it wasn't, it just looked like inefficient. We could have raw input and output streams, and then a set of conversion streams (or actually filters), like this: OutStream os = new OutStream("foo"); // opens a raw outstream StreamBuffer sb = new StreamBuffer(os); ConvStream out = new ConvStream(UTF8, ISO8859-15, sb); ... char[] mytext = "kjsldkfjlskdfjslkd"; fwritefln(out, mytext); Since StreamBuffer eventually outputs everything, one doesn't even have to worry about the buffer getting filled up in "mid-char" if doing output in UTF (not the example above), since the rest of the char gets output later anyhow. I think this looks clean and easy to maintain (for the library maintainer), and it's use is starightforward, flexible, and coceptually clear. This would also bring tighter locality to the whole input/output system, since every stream only does its own thing. With this setup it also becomes much easier for the programmer to write his own stream filters, without having to become a Stream Guru first.

Forums