Non UTF characters in comments (page 2)

Thomas Kühne wrote: > | It *should* just stop dead when it finds one, e.g.: > | error("invalid UTF-8 sequence"); > > I dont think the compiler should try to check the comment's content. Why not ? It checks the rest of the file... > What is an "invalid" UTF-8 sequence? I just think it should treat comments the same way it treats identifiers and literals ? That is, call: utf_decodeChar and follow whatever error that it returns... (utf.c) --anders

I see. Thanks, Sebastian. But text with erroneous utf sequences (not "non UTF character", sic! ) will not be compiled anyway. So for "... other things like text/code editors. What is supposed to happen when a non UTF character is encountered?..." editor (good one) should mark them as "bad string literal" or the like. Andrew Fedoniouk. http://terrainformatica.com "Sebastian Beschke" <s.beschke@gmx.de> wrote in message news:ctgkta$48a$1@digitaldaemon.com... > Andrew Fedoniouk schrieb: >> What does it mean "non UTF character" ? >> UTFs are the forms of representing/encoding full UNICODE table - 21-bit >> charactes (code points). >> >> So "non UTF character" sounds for me as "non UNICODE character". And what >> is that? >> Some new alphabet? > > Invalid sequences *are* possible by using codepoints in the table that aren't defined, or by misforming UTF-8 or UTF-16 sequences. > > -Sebastian

Andrew Fedoniouk wrote: > But text with erroneous utf sequences (not "non UTF character", sic! ) will not be compiled anyway. A bug in the current DMD makes it allow almost everything, in comments. --anders

January 29, 2005

Re: Non UTF characters in comments

Posted by Thomas Kühne
in reply to Anders F Björklund

Permalink

Thomas Kühne

Posted in reply to Anders F Björklund

Attachments:

pgp.sig

Permalink

Anders F Björklund wrote:

| Thomas Kühne wrote:
|
|> | It *should* just stop dead when it finds one, e.g.:
|> | error("invalid UTF-8 sequence");
|>
|> I dont think the compiler should try to check the comment's content.
|
|
| Why not ? It checks the rest of the file...
|
|> What is an "invalid" UTF-8 sequence?
|
| I just think it should treat comments the
| same way it treats identifiers and literals ?
|
| That is, call: utf_decodeChar and follow
| whatever error that it returns... (utf.c)

The current check for identifiers are:
1) shortes possible byte sequence for UTF-8
OK

2) no loone surrogate part
That might clash with pre 1.5 Java output.
This is a Java bug, thus can be ignored.

3) c <= 0x10FFFF
OK

4) c != 0xFFFE && c != 0xFFFF
That's the only check I reject. Those codepoints can occure if a
non-Unicode document is converted to UTF encoded Unicode. Inside of
comments they shouldn't stop the parsing.

Those checks above are - except for the 4th - reasonable for comments.

Thomas

Thomas Kühne wrote: > 4) c != 0xFFFE && c != 0xFFFF > That's the only check I reject. Those codepoints can occure if a > non-Unicode document is converted to UTF encoded Unicode. Inside of > comments they shouldn't stop the parsing. > > Those checks above are - except for the 4th - reasonable for comments. If needed, that can be hacked around for comments, for those two. > s = utf_decodeChar(octet, ndigits, &idx, &c); > if (s || idx != ndigits) can be changed into: s = utf_decodeChar(octet, ndigits, &idx, &c); if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits) Would that make it more reasonable ? (have the DMD patch ready...) --anders

I wrote: >> Looks like DMD allows that in comments and I don't think it's a good >> idea. [...] > Would that make it more reasonable ? (have the DMD patch ready...) Hilarious, the new patch made phobos fail: > ../gcc-3.4.3/gcc/d/phobos/std/loader.d:62: invalid UTF-8 sequence Due to this little comment line, from GDC: > Modified by David Friedman, October 2004 (applied patches from Anders F Björklund.) (as the ö here was in Latin-1, you see...) --anders

Anders F Björklund schrieb: | Thomas Kühne wrote: | |> 4) c != 0xFFFE && c != 0xFFFF That's the only check I reject. Those |> codepoints can occure if a non-Unicode document is converted to UTF |> encoded Unicode. Inside of comments they shouldn't stop the |> parsing. |> |> Those checks above are - except for the 4th - reasonable for |> comments. | | | If needed, that can be hacked around for comments, for those two. | |> s = utf_decodeChar(octet, ndigits, &idx, &c); |> if (s || idx != ndigits) | | | can be changed into: | | s = utf_decodeChar(octet, ndigits, &idx, &c); | if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits) | | Would that make it more reasonable ? (have the DMD patch ready...) Have a look at utf_decodeChar: dmd/utf.c:92 and dmd/utf.c:183 ;) While looking through utf.c I noticed that UTF-32 decoding doesn't undergo any checks. I'll write a bunch of test cases for all those encoding issues tomorrow. Thomas

> A bug in the current DMD makes it allow almost everything, in comments. It's a feature rather than a bug. Preparation for attributed programming I guess. With option to include binary data inline :) I can imagine properties/methods having its own descriptional GIFs given in source text as bytes. Le Cauchemar! BTW: Are there any ports of png/jpeg/gif libs in D? Andrew Fedoniouk. http://terrainformatica.com

Thomas Kühne wrote: > | can be changed into: > | > | s = utf_decodeChar(octet, ndigits, &idx, &c); > | if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits) > | > | Would that make it more reasonable ? (have the DMD patch ready...) > > Have a look at utf_decodeChar: > dmd/utf.c:92 and dmd/utf.c:183 ;) Yes, the idea is that it will not be valid and return string "invalid UTF-8 sequence", which is then ignored because the char is FFFE/F... (all input is converted to UTF-8 before lexer) The patch is in the digitalmars.D.bugs group. --anders

Andrew Fedoniouk wrote: > Preparation for attributed programming I guess. With option to include binary data inline :) :-) No, it's a bug. D source code is supposed to be valid UTF-8/16/32. Ideally, the HTML used should be made to be valid XHTML as well... --anders

Forums